Multi-cpu design

Martin Stjernholm, Roxen IS @ Pike developers forum Thu, 08 Jan 2009 04:35:25 -0800

Good point.

I was thinking of a way to keep the immediate-destruct semantic:
Threads could do a "micro gc" on their own thread local data on each
evaluator callback call (i.e. a bit like the current
destruct_objects_to_destruct calls). These micro gc's would have to
run very quickly though, which probably rules out the generational gc
approach with mark-and-sweep for young data (refcount-based garbing
remains basically equally efficient regardless how often it runs,
while mark-and-sweep does not).


This would mean that the immediate-destruct semantic works as long as
the data is thread local, which is true in the mutex-lock-on-the-stack
scenario. It's also true in most cases when e.g. arrays are built on
the stack using +=. However, cases like

  my_map[key] += ({another_value});

would not be destructive on the array values if my_map is shared. But
that case is hopeless anyway in a multi-cpu world since the array
value always can get refs from other threads asynchronously
(prohibiting that would require locking that would be much worse).

To allow destructive updates in such cases, it'd be necessary to
introduce some kind of new construct so that the pike programmer
explicitly can allow (but not always expect) destructive updates
regardless of extra refs.

However, ditching mark-and-sweep for young data comes at a cost. The
paper I linked to has measured that a purely refcounting collector is
between 20-30% slower when the number concurrent of threads gets above
3 on a 4 cpu box (see pages 80-81). This slowdown is measured over the
total throughput of a benchmark, so it's not just "the gc itself".

Note that this is a comparison between two gc's where the only
difference is the mark-and-sweep for young data - the purely
refcounting collector in this case is still a whole lot more efficient
than the current one in pike due to the drastically lowered refcount
update frequency. I haven't seen any comparisons between the
delayed-update refcount gc and an immediate-update like pike currently
uses, but I suspect that the difference is substantial there already.

So the options we're considering here is either keeping the
immediate-destruct semantic and some single-ref destructive
optimizations, at a cost of (conservatively speaking) at least 15%
overall performance in high-concurrency server apps. I don't think the
single-ref destructive optimizations can weigh up that performance hit
(and in the longer run they can be achieved anyway with new language
constructs). Still, keeping the immediate-destruct semantic is worth
something from a compatibility view.

Multi-cpu design

Reply via email to