Sometimes I find that the best way to understand a technical work is to write my way through it. Hopefully you'll forgive me for using all of you shamelessly to review my comprehension. :-) If I get this right, all of the "new" stuff will appear at the end. I have read through the baseline immix and RC-immix papers. What I'm going to try to do below is (1) describe how RC-immix works, (2) identify questions I have about it, and (3) describe further optimizations I can see.
I'd appreciate feedback on all three parts: anything in RC-immix that I may have misread or misunderstood, any other issues you may see or explanations of why my issues are non-issues, and discussion of further opportunities. The immix collector, by the way, supports multiple mutator threads and multiple collector threads, but it does not support simultaneous execution of collector threads and mutator threads. It is not clear to me (yet) from the RC-immix paper whether this also holds for RC-immix. The immix work did not evaluate execution on larger-scale multicore machines. *Lines and Blocks* The RC-Immix arena follows the basic model of immix with a few changes. The arena is organized as blocks that are in turn organized into lines. The RC-immix paper does not state the block size, so I assume they stuck with the 32KB block size used by immix. Blocks are classified as free (all lines are empty), recycled (some lines still allocated) or full. RC-immix uses a 256B line size in contrast to the immix 128B size. In both systems, there is some metadata stored at the front of the block, so the number of lines in a block is somewhat less than (block size / line size). *Note*: I suspect that recyclable blocks which have only a few free lines are treated as full, but I haven't looked at the code. There isn't really much point to filling a single line in an otherwise large block unless you are trying to evacuate a recyclable block in order to make it free. Which, come to that, they may be doing. Conceptually, there are two block-level allocators: an allocator for recycled blocks and an allocator for free blocks (referred to in the papers as the "overflow allocator"). Each has a corresponding block pool. Blocks are placed in the recycle pool when (a) they are not currently owned by a thread-level allocator, and (b) they have some threshold number of free lines satisfying a suitable metric of fragmentation. Blocks are placed in the free pool when (a) they are not currently owned by a thread-level allocator, and (b) all lines in the block are free. Both the recycled block allocator and the free block allocator are global allocators requiring global synchronizations. *Note*: contention at the block allocation layer is fairly easy to deal with by introducing a hierarchy of free lists corresponding to the CPU topology. For each line, RC-Immix maintains a count of the number of objects in that line. The paper is unclear about whether this count tracks (a) how many objects *start* in that line, or (b) how many objects have some portion of their payload in that line. For each object, RC-immix maintains a single-byte header: RC[3] A sticky reference count S Indicates that the object straddles a line boundary. M Mark bit used in mark/sweep N Indicates that the object is newly allocated FF Indicating the forwarding state of the object The FF bits are alternatively used for object logging in support of deferred referencing counting. This makes it impossible to run a collector and a mutator simultaneously. I believe that the S bit can and should be migrated to the per-line metadata, in which case that bit could be reclaimed to record logging status. - *Question*: it is not clear to me whether logging applies only to young objects or possibly also to old objects. - *Option*: It would be simple to arrange that the S bit is not necessary for a forwarded object. - *Option*: I am not convinced that the N bit is relevant for a forwarded object. - *Option*: Note that new objects never need to be logged. The N bit can be seen as an optimization of the logging scheme in which all "old" reference fields are known to be null. In consequence, the only objects that end up in the log should be old objects. - Opinion: two F bits are not needed. A single F bit can do the job. - Opinion: If the S bit is moved to the per-line metadata, then only the M and one F bit are actually needed for a forwarded object. If these are moved to the least-significant position, a forwarding pointer can be written in the header word without otherwise disturbing the object. Given 8 byte alignment rules, this seems feasible even if the S bit cannot be moved. *New Object Allocation* * * New objects are allocated from a nursery with an initially zero reference count. They are allocated with the "N" (new) bit set in their header, and are not initially reference counted. During nursery collection, the N bit for live objects will be turned off and the reference count will be updated based on the traced in-bound references. The paper is explicit that it is *not* performing a compacting copy of the nursery, but it also mentions coalescing objects in new space. This leaves me with some questions: - *Question:* Does nursery allocation always proceed from an initially free block? - *Question*: Why isn't partial compaction being used here? I.e. a scheme in which up to *N* lines will be moved to the front of the nursery block, with the effect that the oldest lines trend toward the bottom and can be generationally eliminated? Or is this really what is meant by "compaction"? - *Question*: Doesn't the capacity for coalescing imply the need for a complete stack map? - *Question*: After nursery GC, the surviving block is clearly a recycled block. At what point is the residual free space in this block considered low enough that it can no longer serve effectively as a nursery block? What happens then? I am assuming that the allocation strategy in the nursery follows the general pattern of object allocation elsewhere, which I'm describing next. If that is the case, then the scheme needs to keep track of which blocks may contain content from the current nursery generation. *General Allocation* Every thread operates a thread-local allocator (TLA). Where the block allocator allocates blocks, the TLA allocates lines within blocks. At any given time, a TLA may "own" two blocks. The first was obtained from the recycled block allocator. It has ranges of contiguous free lines interspersed with ranges of contiguous allocated lines. Allocation in this block proceeds in increasing address order, allocating each range of free lines in turn using a bump allocator. The second block held by the TLA was initially obtained from the free block allocator and is filled sequentially using a bump allocator. Objects are allocated from this block if and only if they cannot be allocated from the TLA's recycled block. Large objects are allocated from the Large Object Store. The size threshold for large objects is not documented in the papers, and the operation of the large object store is not discussed. Small objects (defined as objects that fit within one line) may not span lines. Medium objects (defined as larger than one line) *may* span lines. If a medium object cannot be allocated within the current free line subrange of the TLA's recycled block, it is allocated instead from the TLA's free block. This overflow management strategy ensures that the TLA does not rapidly advance through the recycled block leaving a lot of holes that could have been used for smaller objects. *Note*: Object allocation sizes are statically known. This prompts me to wonder why the TLA does not use a "small recycled block", "medium recycled block", "overflow block" strategy. Does carrying around more bump allocator base pointers introduce more register pressure than we want to spend on allocation? *Counting* * * New objects are allocated within the nursery, with their N bit set. If the N bit is set, reference counting is not performed. All "old" reference values within the object are known to be null, and any "current" reference values will have their target object reference counts incremented if the new object survives to the next collection. The N bit is cleared as these counts are incremented. Note that an object with an N bit set resides in thread-local storage; no cross-CPU contention exists for such objects. Old objects are logged before mutation so that their reference count updates can be deferred. The logged copy records the pre-modification state of the object. The in-place copy maintains the current state of the object. When an object is logged, a bit is set to indicate that logging has occurred. This bit serves, in effect, as a "dirty" bit for the object. Eventually, reference counts for such an object will be reconciled. According to the benchmarks, the effectiveness of this mechanism is variable. - *Issue*: It is not clear to me why two bits are needed to mark the fact that an object is logged. It seems to me that any mutator may make a thread-local snapshot, update the M bit using bit test-and-set, and then log the old version if and only if the M bit was not set. This ought to be acceptable, since updates to the M bits should be comparatively rare. If this approach is adopted, then the remaining concern is contention between collection threads and mutator threads in regards to *clearing* the M bit. But that is a more general form of contention that needs to get addressed for marking as a whole in any case. - *Issue*: A racing M scheme doesn't interact favorably with a non-logging scheme, because updates to single references might occur while the logged copy is being made. This might explain why two bits are being used for logging state. The per-object count field has limited magnitude. In the RC-immix design, three bits are available, and a count value of 7 indicates a "stuck" count. Live objects having stuck counts are freed by a background tracing collector. *Questions About Logging* There is a question in my mind about whether the object log should be local or global. Logging only occurs on objects in the general heap, so if an object is global, it would make some amount of sense to log it in a global log. If a mechanism were added to mark "hot" objects somehow, that mechanism could be used to further defer reference counting on hot objects. Alternatively, if objects are lock-before-mutate, we could update reference counts on unlock. In that case, logging is best done to a per-thread log. The net effect is somewhat like an RCU mechanism. *Concurrency* * * RC-immix allows mutators to run concurrently or collectors to run concurrently, but mutators may not run concurrently with collectors. I am unclear why this is so. Mutation in the long-term heap is rare, and it would not be difficult to ensure that updates to the appropriate header bits were atomic. The N and M bits could readily be reframed for tri-color marking, with the unused (fourth) representable value meaning "new". There are issues with forwarding barriers that I have raised in a separate email; these do not seem insurmountable. If I were attempting to render this scheme fully concurrent, I believe I would attempt the following changes to the object header bits: 1. Coalesce the M and N bits into a two-bit field whose values are white, grey, black, and new. 2. Shift the S bit to the per-line metadata, allowing expansion of the reference count field. 3. Reduce the forwarding bit to a single bit, using the pointer overwrite method proposed by Nettles and O'Toole, but using the least three (known zero) bits of the forwarding pointer to record the forwarded state (F) and the mark color (if that is needed on a forwarded object - not sure). Though if object sizes permit, it would sure be nice to keep the reference count information on the forwarding stub object, so that we know when all in-bound references have been discovered. 4. Use a single bit to record object logging. 5. Use atomic updates on the object header when necessary. *Other Comments* * * I'm not entirely convinced about the particular combination of mechanisms that is being selected here, but there are a bunch of good ideas that are being cleverly exploited together. The RC-immix paper does not report pause times, which is unfortunate. The URC paper reports typical pause times of 70ms for some benchmarks when using a 4 MByte nursery. This is uncomfortably high. For calibration, modern game frame generation latencies are one frame per 16ms, and trending toward 8 ms. OK. Off to read and internalize the URC paper. shap
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
