(Warning: almost none of the code in this post is tested.) In C, I miss the kind of higher-order programming I can do in languages like Ruby and Python, but in its usual form, that kind of programming seems to require garbage collection (for closures), and that’s not really compatible with running on a microcontroller. But I've thought up some relatively simple enhancements that could substantially increase the power of C-like languages for microcontroller-like targets.
I was thinking about the problems of stack overflow that I was [complaining about on the Arduino back in January][0], and it turns out John Regehr and the TinyOS folks have done some [interesting work on bounding stack sizes through static analysis][stacktool]. (Thanks for telling me, Brandon!) You feed an AVR executable to their analyzer, and it tells you what its maximum stack depth is, or the maximum stack depth of each thread. This is **awesome**. [0]: http://lists.canonical.org/pipermail/kragen-tol/2012-January/000943.html [stacktool]: http://www.cs.utah.edu/~regehr/stacktool/ Stack overflows --------------- This is important in the first place because, if your stack gets big enough, it collides with other data, such as the heap, and then writing to the stack corrupts the other data, and writing to the other data modifies the stack and very likely crashes your program in a mysterious way. This basically never happens in 32-bit operating systems because, first, the gap between the stack and the heap is gigabytes in size, and, second, it has other write-protected stuff in it, so the stack collides with that stuff first, so you get a very non-mysterious error report: a segfault with, typically, a stack hundreds of thousands of calls deep. Or, occasionally, just one, as Yossi Kreinin hilariously explains in his [The C++ Sucks Series: Petrifying Functions][1]. [1]: http://www.yosefk.com/blog/the-c-sucks-series-petrifying-functions.html But it does happen regularly on embedded microcontrollers. Stack analysis and interrupts ----------------------------- A lot of their work has to do with interrupts: when an interrupt fires, the interrupt pushes the current program counter (and, on some architectures, flags) onto the stack. Then, the interrupt handler itself may push more state onto the stack, and if it re-enables interrupts, it may become a reëntrant interrupt handler, dramatically complicating the job of stack analysis. This requires some explanation. Platforms like the PDP-11 (I think) allow a higher-priority interrupt to preempt the handler for a lower-priority interrupt, so that the maximum handling latency for an interrupt depends only on its handler and the higher-priority interrupts. Platforms like the x86 and AVR don’t have multiple priority levels for executing interrupt handlers; they just have an "interrupts enabled" flag. So you have the temptation to make your interrupt handler reëntrant by reënabling interrupts before the interrupt handler is done, in order to reduce the latency for higher-priority interrupts --- but the same original interrupt could then re-occur and make your stack deeper without limit. Stacked interrupt handler data is particularly a problem in a system with a lot of different threads, each of which have small stacks, since the interrupt handlers could stack up on any of the threads. This means that the maximum interrupt handler stack depth gets added to the maximum depth of each thread stack. A solution to this is for the interrupt handler to switch to the main stack before pushing anything else on the stack, so that the cost to each thread stack is only one word of extra maximum depth. So that means that if you have, say, a thread that just endlessly loops in a leaf function with three or four local variables, you could figure that out at compile-time and **give that thread a stack space of, say, 10 bytes**. Which is to say that the entire run-time state of the thread could comfortably and safely be a local variable of another function. Or you could have a global array of them. Python-style generators ----------------------- Anyway, though, I was thinking about Python generators. Python generators are a limited form of coroutine, which, like [Adam Dunkels’s protothreads][2], can only yield control to another thread directly, not from within some other function they’ve called. So, although their own local bindings can’t be kept on a call stack (because some function can instantiate a generator and return it), functions they call could use the main stack, since those functions must return before the generator has a chance to yield control back to another generator or the main thread. [2]: http://www.sics.se/~adam/pt/ I’ve written before about how [generators simplify problems like Peter Naur’s telegram problem][3], although [Dave Long demurred][4], but I haven’t tried to give an explanation of why that is. [3]: http://lists.canonical.org/pipermail/kragen-hacks/2010-September/000514.html [4]: http://lists.canonical.org/pipermail/kragen-discuss/2010-September/001144.html Consider these three Python generators: def each_slice(n, items): buf = [] for item in items: buf.append(item) if len(buf) + 1 > n: yield buf buf = [] yield buf def words(file, pattern='[a-zA-Z]+'): for line in file: for word in re.findall(pattern, line): yield word def ngrams(n, words): queue = [] for word in words: queue.append(word) if len(queue) == n: yield tuple(queue) # Lists aren’t hashable. queue.pop(0) The above allows you to write expressions like the following: ngrams(2, words(open('corpus'))) each_slice(2, words(open('corpus'))) ngrams(3, (word.lower() for word in words(open('corpus')))) Now, you could also write ngrams() or each_slice() to eagerly consume an already-computed input array, and fill up and return an array of their results, at a substantial cost in promptness, storage space, locality of reference, and (if you don’t have garbage collection) memory management complexity. So, for efficiency, you could write each of the expressions above as an explicit loop, with the various stages fused together; for example, the third one becomes: file = open('corpus') queue = [] for line in file: for word in re.findall('[a-zA-Z]+', line): word = word.lower() queue.append(word) if len(queue) == 3: use(tuple(queue)) queue.pop(0) I hope, however, that you will agree that this is considerably harder to read and change than the one-line expression above, which is to say that the parts of the algorithm named as `ngrams` and `words` are mixed together and therefore not reusable. The way that traditional APL and NumPy work is to be fully eager, as I described before; if you write an expression like: (peak * sin(arange(n_samples) * (2*pi * hz / rate))).astype(Int16) then Numeric first allocates an array for the output of `arange` and fills it up, then multiplies each number by `(2*pi*hz/rate)` (in a new array), then computes the sine of each number (in a new array), then multiplies each number by `peak` (in a new array), then converts the result to 16-bit integers (in a new array), making a total of six arrays. For sufficiently large values of `n_samples`, it would certainly be more efficient to do this computation one sample at a time, although the optimal value is probably somewhere in the middle. (I'm obviously no expert here, but you might want to compute at least one SIMD register’s worth of results, and probably several, before you go haring off after function pointers and branch target buffers and mispredicted branches and whatnot.) You could regain that kind of separation of concerns, without the generator or coroutine facility, by constructing explicit iterator objects with `next`; here I use closures to avoid the syntactic tax of Python classes, but I assume that the iteration is terminated by a StopIteration exception, which it’s okay to simply pass along to our caller, as with the normal Python iteration protocol: def words(fnext, pattern='[a-zA-Z]+'): buf = [] def next(): while not buf: buf = re.findall(pattern, fnext()) return buf.pop() return next def ngrams(n, wnext): queue = [] def next(): while len(queue) < n: queue.append(wnext()) rv = tuple(queue) queue.pop(0) return rv return next I’ve done some work here to restructure the code to avoid explicit “state” variables that encode what was previously implicit in the program counter, but it seems to me that the code is more complicated anyway. Letting each generator track its own state with its program counter, rather than having to use explicit variables, allows us to use sequence, alternation, and iteration rather than explicit state transition functions. Recasting `each_slice` without generators is a little hairier: def each_slice(n, items): done = False def next(): if done: raise StopIteration buf = [] while len(buf) + 1 <= n: try: buf.append(items.next()) except StopIteration: done = True return buf return buf return next Python generators are pull-driven (“external iterators”), but that's irrelevant; the benefit of coroutines applies equally to push-driven (“internal”) iterators. Here's `ngrams` as a push-driven iterator; note that it is almost exactly the same as before: def ngrams(n, output): queue = [] while True: queue.append(yield) if len(queue) == n: output.send(tuple(queue)) queue.pop(0) Python-style generators in an extended C ---------------------------------------- So Python-style generators are a very useful language feature. Are they compatible with being added to C, or a language with similar sensibilities? In particular, are they still useful without closures and garbage collection? Yes; here I sketch an outline of how. Each generator routine would be a data type, like a struct or union, of some computable and possibly large size. The value returned from the generator each time would be returned in the same way as a normal function return value --- typically in registers for small return values, or in a fixed spot on the stack or via a hidden pointer argument for larger ones --- and similarly for values passed in to the generator upon initial invocation or resumption. (If you’re trying to implement this yourself in C, the hidden pointer may not be so hidden. The key point is that it doesn't make sense for the iterator to allocate memory for a large return value; its client should do that.) Perhaps it would be nice to automatically (statically or at allocation time) initialize variables of generator type when they are created so that they contain the proper start address for the first resumption; but if you’re allocating them on the heap, you probably need some kind of operation to do the initialization explicitly, and in any case, generators that take arguments are very useful. So perhaps the syntax should look something like this: int generator range(int start, int stop, int step) { for (int ii = start; ii < stop; ii += step) yield(ii); } // ... and here we create a temporary variable on the stack. for (int ii: range(0, 10, 1)) printf("%d\n", ii); // But we can also create a variable: range rr; // Later initialize it. There is no problem using function-call // syntax for this, since C can already do function-call syntax // with some types of values, namely functions and function // pointers. It just needs a rule for how to apply it to a new // type. rr(0, 10, 1); // And later still use it: for (int ii: rr) printf("%d\n", ii); Since they will normally all be invoked from the same place, all of the yield points in the generator need to take the same set of arguments, if they can take arguments, and all that are not the first entry point need to return the same type, as well. (Or types, if you’re wandering a bit further afield from C.) When the generator finally does finish, it needs a way to tell its caller that it’s finished and should not be resumed again. The main alternatives are: - A second return value, or CPU flag. This is easy to achieve at the machine-code level. - Twiddling the return address; for example, you could add the length of a short jump instruction to it, avoiding the need for a conditional jump in the caller, but it still needs an unconditional jump in the caller. - A `longjmp()` (or equivalent) to the end of the loop, using a `jmp_buf` or equivalent passed as a hidden parameter. There’s also the question of how to *write* a wider variety of invocations than `for (foo: somegen) { ... }`. I mean, to do Python’s: try: x = somegen.next() except StopIteration: y() else: z(x) you could do: sometype x; for (x: somegen) goto ok; y(); if (0) { ok: z(x); } but that seems rather obscure! A reasonable, if not very C-like, syntax might be z(next(somegen) { y(); goto somewhere; }); That’s only reasonable because you almost always want to handle the end of an iterator using some kind of nonlocal transfer of control, even if only a `break;`. ### Generator objects with just the local variables ### How big is this in-memory generator object? One approach would be to make the in-memory generator object contain just the local variables, program counter, and perhaps temporary variables of the generator routine, and never to point the stack pointer to it. (Unless it happens to be the top thing in the stack frame of some other function, that is.) In this case, the generator’s local-variable references will need to indirect through a context pointer; this is basically the same thing GCC does with nested functions, and if you want to make the generator resumption look like a normal function call to the caller, you could use the same hack GCC uses for that --- as Norm Hardy showed me one day, GCC wedges this into C by generating a two- or three-instruction trampoline on the stack that loads the context pointer and then jumps to the function. In that case, the generator code itself can execute its “yield” by stuffing the yielded value into the appropriate place and calling a yield routine with the address of its generator object. The yield routine is something like `savu`/`aretu` in V6 Unix: it sticks what ought to be its return address into the generator object it’s passed, and then pops it off the stack before returning --- to the callsite that had last resumed the generator. ### Generator objects containing entire thread stacks ### Lua has a more powerful coroutine facility: its coroutines have entire stacks of their own, so their callees, or callees' callees, can yield control on their behalf --- so a Lua coroutine is essentially a whole thread, but one that only runs when you pass control to it explicitly. Could you do that in C? `swapcontext()` does more or less that already, but with the usual risks of stack overflows. To get rid of that risk, instead of having a special “generator” kind of function, with an icky extra indirection on its local variable accesses, you’d have a “coroutine stack for” type-constructor that could be applied to any stack-depth-bounded function, like the existing type-constructors “pointer to” or “array of ints”, whose size was the statically-computed maximum stack depth for a particular function. Providing such types would be really hard to do with a normal C toolchain, since its size would depend on the entire library you were linking against --- the maximum stack depths and nonrecursivity of the nonrecursive library functions would become part of their ABI! --- but it would be entirely reasonable to implement for embedded targets where the whole linked program might be 16KiB. It does, however, require compiling the entire transitive call tree below a function qwertz in order to figure out how big “coroutine stack for qwertz” was going to be, so multipass compilers as well as separate compilation would be pretty much out. Then the “yield” routine would be more like a normal task switch, like `swapcontext`, simply switching stacks and returning. This approach should make it possible to run dozens, if not hundreds, of cooperative threads, each with full-fledged stacks, within a microcontroller with 4KiB of RAM. Ruby’s iterator mechanism, and implementing it in a C-like language ------------------------------------------------------------------- Ruby’s iterators are almost, but not quite, completely different from Python iterators. It occurred to me that Ruby’s iterator mechanism, which I think originally came from CLU, should *also* be practical to implement in a language for embedded programming --- in particular, it doesn’t require garbage collection, but it is much more convenient. I think it supports higher-order composition of iterator transformations to more or less the same extent as Python's mechanism, but I'm still not quite sure. So, what are Ruby iterators? Ruby has several mechanisms for passing around procedures, which is to say writing higher-order functions, which have the appearance of having been added one by one over time. The simplest one is remarkably effective and remarkably limited: def each start, stop, step i = start while i < stop yield i i += step end end each 1, 10, 2 do |i| puts i end So here we’ve extended the language with a looping construct which takes a sort of lambda (called, in Ruby, a block) as an argument, but *without reifying the block as a first-class value*. Passing arguments to the block is done with a special-purpose `yield` construct, which in Ruby is an expression. `yield` by itself means to invoke the block with no arguments. So there is, by construction, no way to pass the block anywhere else, or do anything other than just invoke it. (Other Ruby mechanisms do allow you to do a wider range of things.) I think this design comes from CLU iterators originally. That doesn’t prevent you from doing the usual downward-funarg things with it; you just have to eta-expand them: def dotimes n each 0, n, 1 do |i| yield i end end It does, of course, prevent you from returning the block or storing it into a data structure, but in a much more elegant way than Pascal’s downward-funarg mechanism. It also prevents the block from being recursive, that is, directly or indirectly calling itself. You can use this block construct for more than just iteration; for example, here it provides a comparator for a sort routine: # (yield a, b) <= 0 whenever a is metaphorically less than or equal to b. def insertion_sort! array (0...array.length).each do |nsorted| # Find latest point to insert next value: pos, to_insert = nsorted, array[nsorted] (nsorted-1).downto 0 do |ii| if (yield array[ii], to_insert) <= 0 break end pos = ii end # Move later items over to make space, then insert it (nsorted-1).downto pos do |ii| array[ii+1] = array[ii] end array[pos] = to_insert end end string_array = ["yes", "YES", "no", "Yes!", "No!"] insertion_sort!(string_array) do |a, b| a.downcase <=> b.downcase end p string_array Traditionally, however, methods like `insertion_sort` above are still called “iterators”, even though this is somewhat imprecise. ### Efficient compilation of iterators without a heap ### It occurred to me that you can also *compile* this in a simpler way than Pascal’s general downward-funarg mechanism, which GCC also implements for C, albeit without the type-system restrictions that make it safe in Pascal. C’s function-pointer mechanism is very simple to implement. A function pointer is a single word, and can be invoked (without arguments or caller-saves, anyway) with a single instruction: call *%ecx # and call the function pointed to Pascal’s mechanism, by contrast, requires two words --- the function pointer and a context pointer: struct fp { int (*fp)(); void *cp; }; The minimal implementation of the call is therefore two instructions, plus an additional instruction either inside or outside the procedure; here it is shown outside: mov (%eax), %ecx # load function pointer mov 4(%eax), %edx # load context pointer call *%ecx And then of course you have to allocate memory for this two-word structure and initialize it. (As I mentioned above, GCC provides this facility for C, too.) Then, inside the nested function, some local variable references indirect through the context pointer, and some do not. This means that some variable references are slower, and the compiler is more complicated. It occurred to me that, under specific circumstances, you can implement Ruby-like blocks more simply. Later I realized that this was something I had learned and forgotten from Raphael Finkel’s book, “Advanced Programming Language Design”, although I think I may be explaining a technique here isn’t in Finkel. The key fact is that, since the block cannot be recursive, it doesn’t need a stack frame for its parameters or local variables. It can simply use storage carved out of its caller’s stack frame. This means that it only needs one context pointer --- the usual stack frame pointer --- not two. And that means that variable references inside the block can be compiled the same as those outside of it. (None of the following applies to SPARC or similar register-window architectures.) ### With a separate frame pointer ### In runtimes that have a separate frame pointer and stack pointer, and where functions store their parameters in their stack frame, this is particularly easy to arrange. The block is represented as a simple code pointer, as in C. The block is invoked by: 1. pushing the iterator’s frame pointer onto the stack; 2. pushing arguments to the block onto the stack or into registers; 3. loading the address for the block code; 4. restoring the iterator’s caller’s frame pointer (which the iterator saved on the stack much earlier); 5. calling the block code, pushing a return address onto the stack, or wherever your architecture likes to keep it. For example: push %ebp # save frame pointer mov -8(%ebp), %eax # block parameter 1 mov -16(%ebp), %ecx # block parameter 2 mov -20(%ebp), %edx # code address of block mov 4(%ebp), %ebp # restore caller’s frame pointer call *%edx # invoke block pop %ebp In the zero-argument case, if we already have the block in a register, with no caller-saves, this reduces to: push %ebp # save frame pointer mov 4(%ebp), %ebp # restore caller’s frame pointer call *%edx # invoke block pop %ebp Upon entry to the block, the frame pointer is already set up to point to the stack frame that the block belongs to, so access to local variables is simply by indexing off the frame pointer. (And the block doesn’t need the usual function prologue/epilogue of push %ebp; mov %esp, %ebp and pop %ebp.) Block parameters are accessible by indexing off the stack pointer or in registers. The block returns control to the iterator with a single `ret` instruction, and either the `ret` or the caller pops the parameters, if they were pushed on the stack instead of passed in registers. This allows the block to call further functions with abandon; their stack frames will be pushed on top of the stubby stack frame containing the block’s return address. However, nonlocal control flow from the block requires care, as do nested blocks. If the block executes a `goto`, `break`, or `return` that leaves the block, the stack pointer must first be adjusted to deallocate the iterator’s frame. In some languages, you also need to run cleanups such as `finally`, `unwind-protect`, `ensure`, or C++ destructors contained in the iterator. But, absent cleanups, this can be handled by storing the desired stack pointer in the frame itself; it can then be restored with a single indexed load: mov 28(%ebp), %esp (`return` in Ruby, like `^` in Smalltalk, exits the function containing the block, not just the block itself. Ruby has a `next` which exits just the block, returning a value to the iterator.) Nested blocks are slightly tricky because they may want to access the parameters of their outer block; for that to work, the outer block must store its parameters into local variables in the stack frame. Although it's probably not worth it, this could be made simpler by giving the iterator a pointer to store those parameters at; instead of just passing a code address, we pass, Pascal-like, a pointer to a structure containing the block code address and its parameters, such as: struct twoargblock { void (*cp)(); int param1; int param2; } Then the sequence to invoke the block becomes something like this instead: push %ebp # save frame pointer mov -20(%ebp), %edx # address of block structure mov -8(%ebp), %ecx # block parameter 1 mov %ecx, 4(%edx) mov -16(%ebp), %ecx # block parameter 2 mov %ecx, 8(%edx) mov 4(%ebp), %ebp # restore caller’s frame pointer mov (%edx), %edx # load code address call *%edx # invoke block This probably simplifies the compiler, but particularly since most blocks don’t have other blocks nested inside of them, it’s probably a net loss for performance. Even blocks that contain nested blocks may be better off simply storing the parameters in a slot in the stack frame on entry: mov %eax, -40(%ebp) mov %ebx, -44(%ebp) ...because that avoids the need to allocate space for, initialize, and fetch the field containing the block code address. ### Without a separate frame pointer ### If you access your local variables by indexing off the stack pointer, the situation is considerably hairier. It would seem that the iterator can’t simply restore its caller’s stack pointer before invoking the block, because then if the block calls any functions, those functions’ stack frames will overwrite the iterator’s own stack frame, causing a crash. But there’s a solution, as found in Finkel’s book, which can even be implemented unreliably with no compiler support in pure C. You just have to locate the iterator’s stack frame far enough away from its caller’s stack frame that it won’t get overwritten! Finkel’s solution, IIRC, is to stack-allocate a big unused array in a trampoline function, and then have the iterator eventually return to its caller by longjmp(), avoiding the trashed return address of the trampoline function (which probably points into the block). The iterator also uses longjmp() to invoke the block, and the block uses longjmp() to jump back down the stack into the iterator each time it finishes. If the compiler can statically bound the stack depth of the block, you can do slightly better. Before calling the iterator, the caller allocates enough stack space for the deepest possible call stack that the block might invoke, and passes both the block’s code address and the stack pointer to use to call the block with. Then the iterator’s invocation of the block looks like this: mov %esp, %eax # save iterator’s stack pointer mov %ecx, %esp # restore block’s stack pointer push %eax call *%edx # invoke block pop %esp # restore iterator’s stack pointer With this approach, you need only pop the single-word block return address from the stack before a nonlocal transfer of control out of the block. -- To unsubscribe: http://lists.canonical.org/mailman/listinfo/kragen-tol