Re: Questions about GC implementations

Robin Garner Fri, 28 Oct 2005 22:58:14 -0700

In the garbage collectors I've worked with, the essential design is:


- 'new' allocates space on the heap.
- The header of each object contains a pointer (or equiv) to a per-class
data structure, that points to the GC map for the object (and virtual
dispatch tables etc etc)
- Reference fields in objects contain pointers directly to the heap
objects they reference.
- Pointer loads and stores are (optionally) performed via barriers - the
terminology is a little confusing: these are not synchronization barriers,
but opportunities for the GC to intercept the load/store and do some
additional processing.  Write barriers are common, read barriers less so.


This is also the approach I have taken, so I think we are
on the same page.  I think we are just saying the same thing
in different words.  When an object is to be allocated,
its pointer will be set by the allocator.  This pointer
is 'robject.pgarbage' and is found in 'jvm/src/object.h'.
The reason you did not find one is because I have only
provided a stub GC implementation.  The 'new' operation
is performed by object_instance_new() in 'jvm/src/object.c'
and includes a call to GC_OBJECT_NEW().

Ok, so now I understand what the 'robject.pgarbage' pointer is - this is the pointer to the body of an object. So do I understand correctly that the body of an array is pointed to by a different pointer in an object header ? Do you realise that this means that a GETFIELD requires two indirections ?

Do you realise that your object header is somewhere in the vicinity of 40 bytes ? The average payload of a java object (at least for SPECjvm98) is 16 bytes. Many of the fields in the bootJVM object header are actually constant across all objects of a class.

The object table needs to go - I think this is a fundamental enough design feature that it needs to be removed as soon as possible. An object header should be 2 words at the most (3 for arrays), and they should be contiguous with the object.

Any time a reference to that object is made, its object
hash, of type 'jvm_object_hash' has a reference recorded
by GC_OBJECT_MKREF_FROM_OBJECT() or GC_OBJECT_FIELD_MKREF()
or GC_OBJECT_MKREF_FROM_JVM().  These are for internal
JVM references, references from Java object reference variables,
and references from Java local method reference variables,
respectively.  When the reference is no longer needed, such as
when an object is destroyed or when a local method returns,
the reference is no longer needed.  When this occurs,
then GC_OBJECT_RMREF_FROM_OBJECT() or GC_OBJECT_FIELD_RMREF()
or GC_OBJECT_RMREF_FROM_JVM() are called, respectively.  When
an object is no longer used, then GC_OBJECT_DELETE() is called.

So these I guess are what we would refer to in the GC world as barriers. My confusion is that a) they don't have enough parameters, and b) there are too many of them. Am I right in assuming that you are invoking a barrier on every update to a local variable, every update to a stack ? This is going to be way too inefficient.

For a stop the world GC, what is needed is a barrier on a PUTFIELD and AASTORE, and a way that the GC can query the current pointers that the VM has into the heap when it starts a collection. Optionally, we can also have a PUTSTATIC barrier, and corresponding read barriers.

I also don't get the distinction between the classes and objects. Either a class is an object, and lives in the heap - in which case it can be treated as an object - or it isn't and lives in the VM.

When a local methods is called, GC_STACK_NEW() is invoked to set
up GC for that stack frame.  Adding and removing objects is done
per above.  Adding and removing references to objects is done
with GC_STACK_MKREF_FROM_JVM() and GC_STACK_RMREF_FROM_JVM().
When it returns, GC_STACK_DELETE() is called.

This is way too heavyweight. Stack manipulation is fast-path code, and shouldn't have to call out to the GC all the time.

- There are many styles of collector, but the most common class uses
tracing, in which a root set of pointers is used to determine an initial
set of live objects, and the collector performs a transitive closure over
this set to establish the set of all live objects.  The root set is
commonly the thread stacks and the static pointer fields.


The macro GC_RUN() refers to the collector.  The stub implementation
shows what it is supposed to do.  The OBJECT_STATUS_GCREQ status bit
controls when an object needs collecting.

So where does OBJECT_STATUS_GCREQ get set ? It's the GC (and only the GC) that knows when an object is alive or not (unless we do reachability analysis in the compiler/interpreter).

All of the above setting up and tearing down of objects and references
to objects have equivalents for classes using the same rationale.
(The class GC pointer is 'rclass.pgarbage' in 'jvm/src/class.h'.
The local method GC pointer is 'JVMREG_STACK_GC_OFFSET' in
'jvm/src/jvmreg.h' and is manipulated by GC_STACK_NEW() and
GC_STACK_DELETE().)

- The above is also complicated by
 . Reference types
 . Finalization
 . Locks
 . Address-based hashing
 The solutions to these are all pretty well known, but complicate the

design

This is pretty much it - the rest (45 years of research) is optimizing the
way this is all done.

         Perhaps I am taking it a bit too simplistically,
but all I did was define a GC hook anywhere an object or
array reference or a class load or unload happened.  How
does your concept differ?  And what sort of approach should
be used instead.

This I think brings us back to my initial question, asking what these
hooks were supposed to do.  I guess you're saying you had a vague idea
that the GC might need to know about these events, so put hooks in for
them.

When I was looking around the code trying to find out where to start
hooking in a managed heap, I looked for the 'new' or 'alloc' operation,
and couldn't seem to find it.


'new' --  see object_instance_new() for objects,
         class_static_new() for classes.

'delete' -- see object_instance_delete() for objects,
           class_static_delete() for classes.

Again, what is it that calls 'delete' - isn't that the GC's job ? I'm still confused.

All four of these functions perform a number of GC activities,
espcially calling the GC_xxx() macros listed above.

Notice that, throughout the body of the code, _any_ time a reference
to an object (or class) is created or destroyed, one or more related
GC functions are called.

What is _done_ with these events is up to the implementor
of the GC algorithm itself.

The GC and the VM need to be in bed a bit more closely than this, at least for an implementation with reasonable performance.

Attached is a tarball with 3 header files, and two (untested) GC implementations - unfortunately there's too much work for me to refactor the VM to match the interface and therefore to test. It's probably best treated as example code for now.


cheers
Robin

gc.tgz
Description: application/compressed-tar

Re: Questions about GC implementations

Reply via email to