[bitc-dev] RC-immix walk-through

Jonathan S. Shapiro Tue, 15 Oct 2013 16:20:09 -0700

Sometimes I find that the best way to understand a technical work is to
write my way through it. Hopefully you'll forgive me for using all of you
shamelessly to review my comprehension. :-) If I get this right, all of the
"new" stuff will appear at the end. I have read through the baseline immix
and RC-immix papers. What I'm going to try to do below is (1) describe how
RC-immix works, (2) identify questions I have about it, and (3) describe
further optimizations I can see.


I'd appreciate feedback on all three parts: anything in RC-immix that I may
have misread or misunderstood, any other issues you may see or explanations
of why my issues are non-issues, and discussion of further opportunities.

The immix collector, by the way, supports multiple mutator threads and
multiple collector threads, but it does not support simultaneous execution
of collector threads and mutator threads. It is not clear to me (yet) from
the RC-immix paper whether this also holds for RC-immix. The immix work did
not evaluate execution on larger-scale multicore machines.

*Lines and Blocks*

The RC-Immix arena follows the basic model of immix with a few changes. The
arena is organized as blocks that are in turn organized into lines. The
RC-immix paper does not state the block size, so I assume they stuck with
the 32KB block size used by immix. Blocks are classified as free (all lines
are empty), recycled (some lines still allocated) or full. RC-immix uses a
256B line size in contrast to the immix 128B size. In both systems, there
is some metadata stored at the front of the block, so the number of lines
in a block is somewhat less than (block size / line size).

*Note*: I suspect that recyclable blocks which have only a few free lines
are treated as full, but I haven't looked at the code. There isn't really
much point to filling a single line in an otherwise large block unless you
are trying to evacuate a recyclable block in order to make it free. Which,
come to that, they may be doing.


Conceptually, there are two block-level allocators: an allocator for
recycled blocks and an allocator for free blocks (referred to in the papers
as the "overflow allocator"). Each has a corresponding block pool. Blocks
are placed in the recycle pool when (a) they are not currently owned by a
thread-level allocator, and (b) they have some threshold number of free
lines satisfying a suitable metric of fragmentation. Blocks are placed in
the free pool when (a) they are not currently owned by a thread-level
allocator, and (b) all lines in the block are free. Both the recycled block
allocator and the free block allocator are global allocators requiring
global synchronizations.

*Note*: contention at the block allocation layer is fairly easy to deal
with by introducing a hierarchy of free lists corresponding to the CPU
topology.


For each line, RC-Immix maintains a count of the number of objects in that
line. The paper is unclear about whether this count tracks (a) how many
objects *start* in that line, or (b) how many objects have some portion of
their payload in that line.

For each object, RC-immix maintains a single-byte header:

RC[3]  A sticky reference count
S        Indicates that the object straddles a line boundary.
M       Mark bit used in mark/sweep
N       Indicates that the object is newly allocated

FF     Indicating the forwarding state of the object


The FF bits are alternatively used for object logging in support of
deferred referencing counting. This makes it impossible to run a collector
and a mutator simultaneously. I  believe that the S bit can and should be
migrated to the per-line metadata, in which case that bit could be
reclaimed to record logging status.

   - *Question*: it is not clear to me whether logging applies only to
   young objects or possibly also to old objects.
   - *Option*: It would be simple to arrange that the S bit is not
   necessary for a forwarded object.
   - *Option*: I am not convinced that the N bit is relevant for a
   forwarded object.
   - *Option*: Note that new objects never need to be logged. The N bit can
   be seen as an optimization of the logging scheme in which all "old"
   reference fields are known to be null. In consequence, the only objects
   that end up in the log should be old objects.
   - Opinion: two F bits are not needed. A single F bit can do the job.
   - Opinion: If the S bit is moved to the per-line metadata, then only the
   M and one F bit are actually needed for a forwarded object. If these are
   moved to the least-significant position, a forwarding pointer can be
   written in the header word without otherwise disturbing the object. Given 8
   byte alignment rules, this seems feasible even if the S bit cannot be moved.

*New Object Allocation*
*
*
New objects are allocated from a nursery with an initially zero reference
count. They are allocated with the "N" (new) bit set in their header, and
are not initially reference counted. During nursery collection, the N bit
for live objects will be turned off and the reference count will be updated
based on the traced in-bound references.

The paper is explicit that it is *not* performing a compacting copy of the
nursery, but it also mentions coalescing objects in new space. This leaves
me with some questions:

   - *Question:* Does nursery allocation always proceed from an initially
   free block?
   - *Question*: Why isn't partial compaction being used here? I.e. a
   scheme in which up to *N* lines will be moved to the front of the
   nursery block, with the effect that the oldest lines trend toward the
   bottom and can be generationally eliminated? Or is this really what is
   meant by "compaction"?
   - *Question*: Doesn't the capacity for coalescing imply the need for a
   complete stack map?
   - *Question*: After nursery GC, the surviving block is clearly a
   recycled block. At what point is the residual free space in this block
   considered low enough that it can no longer serve effectively as a nursery
   block? What happens then?

I am assuming that the allocation strategy in the nursery follows the
general pattern of object allocation elsewhere, which I'm describing next.
If that is the case, then the scheme needs to keep track of which blocks
may contain content from the current nursery generation.

*General Allocation*

Every thread operates a thread-local allocator (TLA). Where the block
allocator allocates blocks, the TLA allocates lines within blocks. At any
given time, a TLA may "own" two blocks. The first was obtained from the
recycled block allocator. It has ranges of contiguous free lines
interspersed with ranges of contiguous allocated lines. Allocation in this
block proceeds in increasing address order, allocating each range of free
lines in turn using a bump allocator.

The second block held by the TLA was initially obtained from the free block
allocator and is filled sequentially using a bump allocator. Objects are
allocated from this block if and only if they cannot be allocated from the
TLA's recycled block. Large objects are allocated from the Large Object
Store. The size threshold for large objects is not documented in the
papers, and the operation of the large object store is not discussed.

Small objects (defined as objects that fit within one line) may not span
lines.

Medium objects (defined as larger than one line) *may* span lines. If a
medium object cannot be allocated within the current free line subrange of
the TLA's recycled block, it is allocated instead from the TLA's free
block. This overflow management strategy ensures that the TLA does not
rapidly advance through the recycled block leaving a lot of holes that
could have been used for smaller objects.

*Note*: Object allocation sizes are statically known. This prompts me to
wonder why the TLA does not use a "small recycled block", "medium recycled
block", "overflow block" strategy. Does carrying around more bump allocator
base pointers introduce more register pressure than we want to spend on
allocation?


*Counting*
*
*
New objects are allocated within the nursery, with their N bit set. If the
N bit is set, reference counting is not performed. All "old" reference
values within the object are known to be null, and any "current" reference
values will have their target object reference counts incremented if the
new object survives to the next collection. The N bit is cleared as these
counts are incremented. Note that an object with an N bit set resides in
thread-local storage; no cross-CPU contention exists for such objects.

Old objects are logged before mutation so that their reference count
updates can be deferred. The logged copy records the pre-modification state
of the object. The in-place copy maintains the current state of the object.
When an object is logged, a bit is set to indicate that logging has
occurred. This bit serves, in effect, as a "dirty" bit for the object.
Eventually, reference counts for such an object will be reconciled.
According to the benchmarks, the effectiveness of this mechanism is
variable.

   - *Issue*: It is not clear to me why two bits are needed to mark the
   fact that an object is logged. It seems to me that any mutator may make a
   thread-local snapshot, update the M bit using bit test-and-set, and then
   log the old version if and only if the M bit was not set. This ought to be
   acceptable, since updates to the M bits should be comparatively rare. If
   this approach is adopted, then the remaining concern is contention between
   collection threads and mutator threads in regards to *clearing* the M
   bit. But that is a more general form of contention that needs to get
   addressed for marking as a whole in any case.
   - *Issue*: A racing M scheme doesn't interact favorably with a
   non-logging scheme, because updates to single references might occur while
   the logged copy is being made. This might explain why two bits are being
   used for logging state.

The per-object count field has limited magnitude. In the RC-immix design,
three bits are available, and a count value of 7 indicates a "stuck" count.
Live objects having stuck counts are freed by a background tracing
collector.

*Questions About Logging*

There is a question in my mind about whether the object log should be local
or global. Logging only occurs on objects in the general heap, so if an
object is global, it would make some amount of sense to log it in a global
log. If a mechanism were added to mark "hot" objects somehow, that
mechanism could be used to further defer reference counting on hot objects.

Alternatively, if objects are lock-before-mutate, we could update reference
counts on unlock. In that case, logging is best done to a per-thread log.
The net effect is somewhat like an RCU mechanism.

*Concurrency*
*
*
RC-immix allows mutators to run concurrently or collectors to run
concurrently, but mutators may not run concurrently with collectors. I am
unclear why this is so. Mutation in the long-term heap is rare, and it
would not be difficult to ensure that updates to the appropriate header
bits were atomic. The N and M bits could readily be reframed for tri-color
marking, with the unused (fourth) representable value meaning "new". There
are issues with forwarding barriers that I have raised in a separate email;
these do not seem insurmountable. If I were attempting to render this
scheme fully concurrent, I believe I would attempt the following changes to
the object header bits:

   1. Coalesce the M and N bits into a two-bit field whose values are
   white, grey, black, and new.
   2. Shift the S bit to the per-line metadata, allowing expansion of the
   reference count field.
   3. Reduce the forwarding bit to a single bit, using the pointer
   overwrite method proposed by Nettles and O'Toole, but using the least three
   (known zero) bits of the forwarding pointer to record the forwarded state
   (F) and the mark color (if that is needed on a forwarded object - not
   sure). Though if object sizes permit, it would sure be nice to keep the
   reference count information on the forwarding stub object, so that we know
   when all in-bound references have been discovered.
   4. Use a single bit to record object logging.
   5. Use atomic updates on the object header when necessary.


*Other Comments*
*
*
I'm not entirely convinced about the particular combination of mechanisms
that is being selected here, but there are a bunch of good ideas that are
being cleverly exploited together.

The RC-immix paper does not report pause times, which is unfortunate. The
URC paper reports typical pause times of 70ms for some benchmarks when
using a 4 MByte nursery. This is uncomfortably high. For calibration,
modern game frame generation latencies are one frame per 16ms, and trending
toward 8 ms.


OK. Off to read and internalize the URC paper.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

[bitc-dev] RC-immix walk-through

Reply via email to