On Sunday 04 November 2001 02:39 pm, Dan Sugalski wrote:
> At 08:32 PM 11/4/2001 +0100, Benoit Cerrina wrote:
> > > There will be a mechanism to register PMCs with the interpreter to note
> > > they're pointed to by something that the interpreter can't reach. (For
> > > example, a structure in your extension code, or via a pointer stashed
> > > in the depths of a buffer object, or referenced by another interpreter)
> > > This "foreign access" registry is considered part of an interpreter's
> > > root set.
> >
> >If this is the case, how do you want to move the PMC, I thought you wanted
> > a copying collector?
>
> While the PMC structures themselves don't move (no real need--there of
> fixed size so you can't fragment your allocation pool, though it makes
> generational collection easier to some extent) the data pointed to by the
> PMC can. That's the bit that moves.
>

Ok, so far, here's what I see:

There are two main memory segments.  A traditional alloc/free region.  Within 
this area are arena's (for fixed sized memory objects).  This region must be 
efficient within MT, and hopefully not too wasteful or fragmenting.  This 
region is mostly for core operation and the non arena allocations are to be 
minimized; memory leaks are critical as usual. The utilization patterns 
should be analyzable, well known and compensated for (with respect to 
fragmentation, thread-contention, etc).  My vmem-derivative might still be 
valuable here, but I suppose we need to let the core's needs/characteristics 
flesh out further.

Then there's a GC region which is primarly the allocation space of the 
interpreted app, which obviously can't be trusted with respect to memory 
leaks or usage (fragmentation) patterns.  PMC's and strings are handles into 
this GC-region, though the handles are to be stored in core-memory (above).

Question: Are there other types of references?  I can't think of any.

The GC algorithm (being a separate thread or called incrementally when memory 
is low (but not yet starved)), needs to quickly access this GC heap's values, 
which I believe is Dans for requiring a maximum of two levels of 
indirection.  I suppose it just runs down the known PMC lists including the 
PMC and string register sets, the stacks and the stashes for each 
interpreter.  You do, however refer to a "foreign address" region, and my 
first impression is that it's the wrong way of going about it.

First of all, how are arrays of arrays of arrays handled?

// create a buffer object of PMCs
PMC_t p1 = new Array(50)

for i in 0 .. 49:
  p1->data[ i ] = new Array(50)

In order to comply with the max-depth, the array creation will have to 
register each sub-array-entry in the foreign access region, or am I missing 
something?

First of all, this will mean that the foreign access data-structure will grow 
VERY large when PMC arrays/ hashes are prevalant.  What's worse, this 
data-structure is stored within the core, which means that there is 
additional burden on the core memory fragmentation / contention.

Additionally, what happens when an array is shared by two threads (and thus 
two interpreters).  Who's foreign access region is it stored in?  My guess is 
that to avoid premature freeing, BOTH.  So now a work-q used by a 30-thread 
producer/consumer app is allocating and freeing LOTS of core-memory on each 
enqueue / dispatch..  Again, with the details this fuzzy, I'm probably going 
off half-cocked; but I did qualify that this was my initial impression.

My suggestion is to not use a foreign references section; or if we do, not 
utilize it for deep data-structure nesting.  And instead incorporate a doubly 
linked list w/in PMCs and strings...  Thus wheneever you allocate a PMC or 
string, you attach it to the chain of allocated handles.  Whenever the PMC is 
free'd, you detach it.  The GC then has the laughably simple task of 
navigating this linked list, which spans all threads.  This can encorporate 
mark-and-sweep or copying or what-ever.  By adding 8 or 16 bytes to the size 
of a PMC / string, you avoid many memory related problems.  Not to mention 
the fact that we are free of the concern of depth away from the 
interpreter-root.

Beyond this, I think I see some problems with not having PMCs relocatable.  
While compacting the object-stores that are readily resized can be very 
valuable, the only type of memory leak this avoids is fragmentation-related.  
The PMCs themselves still need to be tested against memory leaks.  Now I'm 
still in favor of some form of reference counting; I think that in the most 
common case, only one data-structure will reference a PMC and thus when it 
goes away, it should immediately cause the deallocation of the associated 
object-space (sacraficing a pitance of run-time CPU so that the GC and free 
memory are relaxed).  But I hear that we're not relying on an integer for 
reference counting (as with perl5), and instead are mostly dependant on the 
GC.   Well, if we use a copying GC, but never move the PMC, then how are we 
freeing these PMCs?  It would be possible to additionally utilize 
mark-and-sweep, but then we'd have three separate algorithms:  Simplified 
reference counting deallocation (when a multi-referenced flag is not set), 
object store compaction (to minimize the variable-sized heap stored in the 
copying GC's heap), and mark-and-sweep when we're really memory starved and 
we don't trust that objects have been freed.

If at least the last two parts of the algorithm are used, then the GC would 
periodically navigate the list.  On the first pass, it would find find stale 
handles (those freed by simplified reference counting), and detach them from 
the list.  Note If a non GC thread detached the PMC from the chain, it would 
derail the GC.  Global locking would be required which is a bad thing (tm).  
The first pass also would reset the mark-and-sweep flags.

The second pass performs the marking.  I believe this one requires navigation 
of all the interpreter data-structures ( stacks, register sets, etc ).  In 
fact, it seems to me that since within the allocation-chain, only references 
need be utilized in the mark-stage.. Perhaps a second linked list could be 
provided for such references.. Perhaps an additional 8/16Bytes within each 
PMC (or within the reference-specific datastructure).  This would cut-down on 
the mark-stage's overhead.

Lastly, the PMC-allocation-chain is navigated.  Those PMCs/ Strings not 
"marked" are freed.  Those that are are "copied" into the mirror heap so as 
to consolidate free memory.

Even with this, I see some issues:

Firstly, allocating a new PMC requires special care so as not to derail the a 
GC navigating the PMC-allocation-list in another thread (since we're avoiding 
the use of lock contention)

Secondly, PMCs are allocated by attaching themselves to the head of the 
allocation-chain.  This could require a global lock on this one tiny part.  
While relatiely fast, it might be better for each interpreter to maintain a 
separate chain (and thus head-pointer).  This requires that the GC navigate 
each chain separately which adds nominal overhead.  Note that sharing of PMCs 
between interpreters doesn't affect this, since only the PMC-creater and the 
GC needs to worry about the allocation-chains.

Thirdly, since partial ref-counting doesn't free the PMC handle, it's 
possible that algorithms that allocate and free millions of PMCs will still 
grow the OS heap enormously.  I really want to avoid this phenomena, but it 
would involve locking the allocation-chain.  Perhaps, taking a queue from 
vmem, regional locks can be performed.  The GC and the worker-thread compete 
for these tiny regions which have nominal lock-times; spin-locks should 
suffice. Namely, split the allocation chain into "magazines" of m items, then 
produce a linked list of these magazines.. Locks occur on the magazines.  
This adds little or no overhead to the worker thread, but does add some 
overhead to the GC (performing lots of [un]locking).

Fourthly, to allow new PMC allocations to occur while the GC navigates the 
list, it will have to temporarily detach all the lists from their 
head-pointers (unless regional locking is used).  When the GC is done, it can 
reattach the compacted lists to their respective heads.  If regional locking 
isn't used, a simple lock around the head pointer can be used very quickly.

Fithly, it would really be nice to separate portions of the GC-heap between 
threads.  One thing I learned from the vmem-archetecture is the danger of 
cache-sharing.  If you carve out an 8 byte region from the heap for one 
thread, then the next 8 bytes for another thread (both running on separate 
CPUs), then they're going to share a cache-line, and thus cripple 
performance.  I've witnessed this even on my dual proc celeron.  While it's 
acceptible let "some" objects be shared, a raw stack-carver would garuntee 
the demise of performance of multi-CPU architectures in an MT environment.  
One way to avoid this is to use separate heaps for each thread.. This all but 
eliminates locking contention, while solving the cache-sharing problem.  One 
obvious down-side is that you require more initial memory.  By producing 
linked-lists of heap-chunks (of size 128K) instead of having monolithicly 
growing heaps, the compacting can occur in stages, and thus dramatically 
minimize the unused portions of memory. (By never have more than a single 
128K heap-chunk that's empty).  Though I'm sure these details are addressed 
by existing state-of-the art GCs.

Lastly, with the increase of complexity, it would be interesting to weigh the 
pros and cons of a purely reference-counted PMC/string.  The non-PMC 
object-data would take advantage of the periodic copy-GC, but the 
object-handle management would be completely maintained by the "set_p_xxx", 
and stack op-codes.  Each entry in a register counts as a reference.  While 
slightly more run-time maintainance, we avoid periodic O(2n) linked-list 
lookups (for millions of PMCs).  Since not all PMCs or strings even refer to 
things in the GC-heap, this is an effective compromise.  Just as a point of 
comparison.  True we still have the potential for leaks, but that is 
avoidable even in perl5.6.

In Summary, the lingering issues are:
a) PMC memory management is complex and non-ideal whether PMCs are 
relocatable or not. And whether they're GC'd or ref-counted (meaning 
someone's going to spend CPU time somewhere) .  GC provides less worrying 
points for the core interpreter, but necessitates more asynchronous 
work-loads.

b) "Foreign access" management seems problematic.

-Michael





Reply via email to