On Sat, 21 Apr 2012, Martin Stjernholm, Roxen IS @ Pike  developers forum wrote:

/.../ The completely inlinable macros the old allocator is using
seem quite hard to beat.

Is there a big difference? Tobias' comparison with the pike benchmarks
appear to show some speedups and some slowdowns in various tests,
which I attribute to noise mostly. I couldn't see any obvious trend
that it's slower.

There are a few tests which suffer noticably (e.g. AppendMapping(2) and
some others) and I have no idea what the reason is. most of the tests
look the same or faster with the new one.

Actually, I thought there would be a win from not inlining so much as
Pike currently does - it's better to fit more code into the cpu caches
than to inline some constants. I'm keen on reducing the macros and
inlining in Pike in general.

Yes, true. It depends about the code surrounding it. In those micro
benchmarks inlining is clearly a win. Inside of the pike source I dont
know. I guess the 'noise' we see is somehow related to that, aswell. And
-O3 doesnt make inlining very predictable.

I see no problem however, merging it back into the tree. /.../

That'd be great, I think. As I said, I'm in favor, and nobody has
spoken against it so far. It'd be an experimental feature in the
beginning though - we usually add a configure switch to be able to
switch back to the old implementation.

yes, that sounds good.

Do you have any plans for switching to object-local storage? /.../

It would be worthwhile to test the overhead when only using few
blocks. The allocator tries to make initialization as cheap as
possible, so it might not even hurt performance much. On the other
hand, memory overhead could be quite high.

Yes, allocating a full block for every object would not be good,
unless the prospective use cases only are a few very large trees, but
I take it this is intended as a general-use data structure?

I think the ideal memory management would be to use a realloc
double-the-size strategy for small trees, and switch to a block
allocator when they reach the size of a memory page. That's more
complex, of course.

Interesting idea. So then one would grow the page until it reaches the
allocators page size and let the allocator handle it from then on?

We had some plans to make a thread safe allocator out of it,
allocating blocks from thread local pages, etc. However, it never
got any further than some thinking.

It's easy enough to let nodes start out in a thread local area, but
how do you keep them that way when you share the data structure
between threads? You'd have to implement some sort of migration
heuristic so that nodes can move to the thread local area of another
thread, and that still wouldn't work well if several threads are
manipulating the same large tree.

The allocator page would be local in the sense that only one thread is
serving allocations from it. Deallocations could happen from all
threads, so there some synchronization is necessary. I was hoping that
a CAS would be enough there.
Its not useful for pike, because of the global lock. We were just
pondering about how one could go about making it thread safe without
locks.

I can't see how that can be made to work well, so that's why I think
it's better to opt for locality-by-relation, i.e. that nodes in the
same (sub)tree tend to be kept close to each other. Then trees which
are used by a single thread only will effectively be in thread local
storage for that thread.

yes, true.

When multisets are reallocated, their nodes are always sorted in
memory. That's not any significant extra work since the operation is
linear anyway, it keeps the nodes of the smallest subtrees right next
to each other, and an iterator will access memory completely linearly.

Interesting, will have a read there. What we did in the CritBit
implementation is much simpler than that. Modifications of the tree
while iterators are active forces the iterator to make one lookup
next time.

I take it you have a generation counter in each object then, so that
the iterator can test if its direct pointer still is valid?

exactly.

It can potentially hurt performance quite a lot, though.

Performance would suffer only when iterators are used a lot in heavily
mutating trees, wouldn't it? That's a fairly peculiar use case imo.
Sounds to me like your simpler approach will do just fine in general.

right, its probably not the dominant use case.

There is other places where object-local storage might help. Mapping
rehash for instance could be made much simpler, now that we have
power of 2 hashtables (the bucket n gets split into bucket n and n*2
when the hashtable grows). However, I dont know of any easy way to
make that work properly with iterators.

Doesn't the generation counter scheme solve that as well?

when the table has not been grown, a generation counter should be enough
to restart from the same bucket. one would have to keep track of all
entries the iterator has already seen in that bucket. If the table
grows or shrinks, it gets complicated.
    • Re: ne... Tobias S. Josefowitz
      • Re... Martin Nilsson (Opera Mini - AFK!) @ Pike (-) developers forum
        • ... Arne Goedeke
  • Re: new blo... Martin Stjernholm, Roxen IS @ Pike developers forum
  • Re: new blo... Tobias S. Josefowitz
    • Re: ne... Martin Stjernholm, Roxen IS @ Pike developers forum
      • Re... Arne Goedeke
        • ... Arne Goedeke
        • ... Martin Stjernholm, Roxen IS @ Pike developers forum
          • ... Arne Goedeke
          • ... Arne Goedeke
          • ... Arne Goedeke
    • Re: ne... Tobias S. Josefowitz
      • Re... Martin Stjernholm, Roxen IS @ Pike developers forum
        • ... Arne Goedeke

Reply via email to