As Peter has pointed out, our stackwalk code is rather slow. The code that's in there was my first-attempt at the need for stack walking code. There's one optimization in place, but the algorithm behind the optimization could use some work.
Basically, it finds the min and max values of all headers. It does a check (for quick failure purposes) to see if the data on the stack is in that range, and then proceeds to do the accurate check. The accurate check consists of walking through each header pool in an attempt to see if this pointer-sized data could be interpreted as being in that pool. Currently, this is a linear walk over the header pools. I imagine there are many better algorithms for determing a root set from a stack. The boehm collector probably has decent code in this regard. However, given that we have O(N) with size of stack, I'm not sure how we'll be able to alleviate this in the long run. Anyone feeling adventuresome and want to attempt to speed this up? It should be an easy introduction to the GC code in general. Just start out in trace_system_stack, and work your way down. Mike Lambert