On Sunday 04 November 2001 02:39 pm, Dan Sugalski wrote: > At 08:32 PM 11/4/2001 +0100, Benoit Cerrina wrote: > > > There will be a mechanism to register PMCs with the interpreter to note > > > they're pointed to by something that the interpreter can't reach. (For > > > example, a structure in your extension code, or via a pointer stashed > > > in the depths of a buffer object, or referenced by another interpreter) > > > This "foreign access" registry is considered part of an interpreter's > > > root set. > > > >If this is the case, how do you want to move the PMC, I thought you wanted > > a copying collector? > > While the PMC structures themselves don't move (no real need--there of > fixed size so you can't fragment your allocation pool, though it makes > generational collection easier to some extent) the data pointed to by the > PMC can. That's the bit that moves. >
Ok, so far, here's what I see: There are two main memory segments. A traditional alloc/free region. Within this area are arena's (for fixed sized memory objects). This region must be efficient within MT, and hopefully not too wasteful or fragmenting. This region is mostly for core operation and the non arena allocations are to be minimized; memory leaks are critical as usual. The utilization patterns should be analyzable, well known and compensated for (with respect to fragmentation, thread-contention, etc). My vmem-derivative might still be valuable here, but I suppose we need to let the core's needs/characteristics flesh out further. Then there's a GC region which is primarly the allocation space of the interpreted app, which obviously can't be trusted with respect to memory leaks or usage (fragmentation) patterns. PMC's and strings are handles into this GC-region, though the handles are to be stored in core-memory (above). Question: Are there other types of references? I can't think of any. The GC algorithm (being a separate thread or called incrementally when memory is low (but not yet starved)), needs to quickly access this GC heap's values, which I believe is Dans for requiring a maximum of two levels of indirection. I suppose it just runs down the known PMC lists including the PMC and string register sets, the stacks and the stashes for each interpreter. You do, however refer to a "foreign address" region, and my first impression is that it's the wrong way of going about it. First of all, how are arrays of arrays of arrays handled? // create a buffer object of PMCs PMC_t p1 = new Array(50) for i in 0 .. 49: p1->data[ i ] = new Array(50) In order to comply with the max-depth, the array creation will have to register each sub-array-entry in the foreign access region, or am I missing something? First of all, this will mean that the foreign access data-structure will grow VERY large when PMC arrays/ hashes are prevalant. What's worse, this data-structure is stored within the core, which means that there is additional burden on the core memory fragmentation / contention. Additionally, what happens when an array is shared by two threads (and thus two interpreters). Who's foreign access region is it stored in? My guess is that to avoid premature freeing, BOTH. So now a work-q used by a 30-thread producer/consumer app is allocating and freeing LOTS of core-memory on each enqueue / dispatch.. Again, with the details this fuzzy, I'm probably going off half-cocked; but I did qualify that this was my initial impression. My suggestion is to not use a foreign references section; or if we do, not utilize it for deep data-structure nesting. And instead incorporate a doubly linked list w/in PMCs and strings... Thus wheneever you allocate a PMC or string, you attach it to the chain of allocated handles. Whenever the PMC is free'd, you detach it. The GC then has the laughably simple task of navigating this linked list, which spans all threads. This can encorporate mark-and-sweep or copying or what-ever. By adding 8 or 16 bytes to the size of a PMC / string, you avoid many memory related problems. Not to mention the fact that we are free of the concern of depth away from the interpreter-root. Beyond this, I think I see some problems with not having PMCs relocatable. While compacting the object-stores that are readily resized can be very valuable, the only type of memory leak this avoids is fragmentation-related. The PMCs themselves still need to be tested against memory leaks. Now I'm still in favor of some form of reference counting; I think that in the most common case, only one data-structure will reference a PMC and thus when it goes away, it should immediately cause the deallocation of the associated object-space (sacraficing a pitance of run-time CPU so that the GC and free memory are relaxed). But I hear that we're not relying on an integer for reference counting (as with perl5), and instead are mostly dependant on the GC. Well, if we use a copying GC, but never move the PMC, then how are we freeing these PMCs? It would be possible to additionally utilize mark-and-sweep, but then we'd have three separate algorithms: Simplified reference counting deallocation (when a multi-referenced flag is not set), object store compaction (to minimize the variable-sized heap stored in the copying GC's heap), and mark-and-sweep when we're really memory starved and we don't trust that objects have been freed. If at least the last two parts of the algorithm are used, then the GC would periodically navigate the list. On the first pass, it would find find stale handles (those freed by simplified reference counting), and detach them from the list. Note If a non GC thread detached the PMC from the chain, it would derail the GC. Global locking would be required which is a bad thing (tm). The first pass also would reset the mark-and-sweep flags. The second pass performs the marking. I believe this one requires navigation of all the interpreter data-structures ( stacks, register sets, etc ). In fact, it seems to me that since within the allocation-chain, only references need be utilized in the mark-stage.. Perhaps a second linked list could be provided for such references.. Perhaps an additional 8/16Bytes within each PMC (or within the reference-specific datastructure). This would cut-down on the mark-stage's overhead. Lastly, the PMC-allocation-chain is navigated. Those PMCs/ Strings not "marked" are freed. Those that are are "copied" into the mirror heap so as to consolidate free memory. Even with this, I see some issues: Firstly, allocating a new PMC requires special care so as not to derail the a GC navigating the PMC-allocation-list in another thread (since we're avoiding the use of lock contention) Secondly, PMCs are allocated by attaching themselves to the head of the allocation-chain. This could require a global lock on this one tiny part. While relatiely fast, it might be better for each interpreter to maintain a separate chain (and thus head-pointer). This requires that the GC navigate each chain separately which adds nominal overhead. Note that sharing of PMCs between interpreters doesn't affect this, since only the PMC-creater and the GC needs to worry about the allocation-chains. Thirdly, since partial ref-counting doesn't free the PMC handle, it's possible that algorithms that allocate and free millions of PMCs will still grow the OS heap enormously. I really want to avoid this phenomena, but it would involve locking the allocation-chain. Perhaps, taking a queue from vmem, regional locks can be performed. The GC and the worker-thread compete for these tiny regions which have nominal lock-times; spin-locks should suffice. Namely, split the allocation chain into "magazines" of m items, then produce a linked list of these magazines.. Locks occur on the magazines. This adds little or no overhead to the worker thread, but does add some overhead to the GC (performing lots of [un]locking). Fourthly, to allow new PMC allocations to occur while the GC navigates the list, it will have to temporarily detach all the lists from their head-pointers (unless regional locking is used). When the GC is done, it can reattach the compacted lists to their respective heads. If regional locking isn't used, a simple lock around the head pointer can be used very quickly. Fithly, it would really be nice to separate portions of the GC-heap between threads. One thing I learned from the vmem-archetecture is the danger of cache-sharing. If you carve out an 8 byte region from the heap for one thread, then the next 8 bytes for another thread (both running on separate CPUs), then they're going to share a cache-line, and thus cripple performance. I've witnessed this even on my dual proc celeron. While it's acceptible let "some" objects be shared, a raw stack-carver would garuntee the demise of performance of multi-CPU architectures in an MT environment. One way to avoid this is to use separate heaps for each thread.. This all but eliminates locking contention, while solving the cache-sharing problem. One obvious down-side is that you require more initial memory. By producing linked-lists of heap-chunks (of size 128K) instead of having monolithicly growing heaps, the compacting can occur in stages, and thus dramatically minimize the unused portions of memory. (By never have more than a single 128K heap-chunk that's empty). Though I'm sure these details are addressed by existing state-of-the art GCs. Lastly, with the increase of complexity, it would be interesting to weigh the pros and cons of a purely reference-counted PMC/string. The non-PMC object-data would take advantage of the periodic copy-GC, but the object-handle management would be completely maintained by the "set_p_xxx", and stack op-codes. Each entry in a register counts as a reference. While slightly more run-time maintainance, we avoid periodic O(2n) linked-list lookups (for millions of PMCs). Since not all PMCs or strings even refer to things in the GC-heap, this is an effective compromise. Just as a point of comparison. True we still have the potential for leaks, but that is avoidable even in perl5.6. In Summary, the lingering issues are: a) PMC memory management is complex and non-ideal whether PMCs are relocatable or not. And whether they're GC'd or ref-counted (meaning someone's going to spend CPU time somewhere) . GC provides less worrying points for the core interpreter, but necessitates more asynchronous work-loads. b) "Foreign access" management seems problematic. -Michael