Re: [bitc-dev] Interior references

Jonathan S. Shapiro Fri, 01 Nov 2013 08:13:09 -0700

On Thu, Oct 31, 2013 at 9:35 PM, Ben Kloosterman <[email protected]> wrote:


>
>>
>
>> So either *all* objects (interior and exterior) have a GC header at
>> obref[-n], or the write barrier can't examine the object header. And that's
>> mostly fine. Except that deferred reference counting needs the 'M' bit, and
>> handling forwarding needs the 'F' bit, so I'm not convinced that getting
>> rid of the header is an option. If we declare that *all* objects have a
>> vTable slot then I guess we could use the low-order bits, but that entails
>> layout issues too, because not all objects have vTable pointers. Notably:
>> CLR value types do not carry a /this/ pointer.
>>
>
> Which means you loose the 1:1 native rep . So you have to either have a
> seperate base type that cant take an internal reference or do more fancy
> marshalling ? Both have a high price .
>

Or you give up on any technique that requires a barrier. Which means giving
up deferred reference counting.

After writing my clarification about by-ref parameters, I realized that we
need to do write barriers on those as well, and those write barriers won't
be cheap, because they potentially have to find the object root. It's
beginning to smell like REF 'vt types only work because they aren't used
all that often.


> But thats my point on boxing   , it has little to do with not taking an
> internal reference. You need a box for
> - "Everything is an Object" so value types are cast as objects , V1  did
> not have generics .
> -  Cast to interfaces .  Probably a bad idea to cast value types as
> interfaces and certainly not needed with type classes.
> - There are some rarer cases
>
> Im not convinced by everything is an object this was needed for
> collections in v1 but again its not needed with type classes.   We can
> handle internal references to an array via slices.
>

Do you mean here that "everything is a descendant of object"? If so, I
agree that was mistake. You want a "bottom" in the type system, but I don't
think Object was the right choice.

If we're trying to avoid the header on interior unboxed objects, we don't
>> need it for finding the base object. Note that objects are a minimum of 4
>> words (when alignment requirements are considered). On a 32-bit system
>> that's 16 bytes, in which case a 128byte immix line holds 8 objects. How
>> convenient that there are 8 bits in the per-line metadata byte. Instead of
>> using an object count, we can use a bit to mark which 4-word boundaries
>> correspond to the start of an object. Given such bits, we can walk
>> backwards fairly cheaply to find containing objects.
>>
>
> That seems a better strategy
>

I think it's a better strategy for finding the containing object, but it
doesn't address the need to have bits for the write barrier for deferred RC.

Java has no value types at all and is pretty fast ,  its not needed just an
>>> optomization
>>>
>>
>> It is NOT just an optimization. The absence of general unboxed types is a
>> fundamental expressiveness limitation. You simply cannot write systems code
>> in any sensible way without it. I* *challenge *anyone* to re-code the
>> Coyotos kernel with comparable efficiency in Java. You get to tell the
>> compiler that no collector is required, so that's a free 25%-30% advantage
>> over conventional Java right there.
>>
>> It. Can't. Be. Done.
>>
>> Java lacks the expressiveness to describe the data structures.
>>
>
> But we are not talking about just using it for system structures .. as has
> been noted C# fixed and referenceless  structs are enough .
>

I agree: we're not talking about using it just for system problems. But the
*ability* to use BitC for systems problems is one of the big differences in
comparison to other systems. Because of this, it's unfortunate that
reference-free structs and "fixed" *aren't* good enough.

Fixed is unacceptable because it is unsafe. Reference-free structs are not
good enough for writing an operating system or other low-level systems
codes. Go have a look at the kinds of data structures we're talking about
here:

http://dev.eros-os.com/hg/coyotos/trunk/file/tip/src/sys/kerninc


The Processs and GPT structures are good ones to examine. Also how these
relate to ObjectHeader.

In any case, even if the structs are reference-free, we still need the 'M'
bit to efficiently support deferred reference counting.

We are adding this to increase the expressiveness of the language.
>

That's one view. The other view is that we are trying to address this in
order to bring the expressiveness of Java and C# more in line with that of
C.

And there is a lot of overhead   it was already pointed out that every
> header word in 32 bit added 3% , then you have the other costs ( like GC
> tracking , looping type reference lookup vector ) etc.
>

It's starting to look like I can do the GC object header in a single byte.

Though it should be said that compare-and-swap on a single thread actually
>> isn't that expensive.
>>
>
> In low contention its cheap  which should be the case here.. as the
> collector rarely runs . .
>

Singe thread is zero contention. In that case CMPXCHG is cache-local with
no broadcast invalidate (which is why the concurrent version is expensive).


> The problem with this approach is that it assumes a per-object lock, which
>> is not how I want things to work.
>>
>
> A lock on the line is nice for the collector but would be expensive for
> the mutator
>

That's not what I'm after either. I really want to leave lock relationships
in the hands of the developer.


> But now that a lot of the cases are handled in the store barrier, I think
>> that the *rest* can be handled with page faults if we really need to.
>>
>
> +1.  That looks promising  ( even if it does mean lots of  page changes to
> the kernel) .. I asume then you could just get the lock ( page level)  in
> the page fault .  And when the collector is not running  your just back to
> the write barrier.
>

It doesn't need a page-level lock at all. And it's not an issue in the
kernel, because a well-structured kernel never releases memory at all.
There's no kernel GC.


> Unfortunately, the data says it isn't. The problem is that 10% of objects
>> do not die young. That's true in the second generation too. So in a given
>> 32KB block you might have as many as 1K objects, 100 of which don't die
>> very fast. That block can be recycled, but it won't become free any time
>> soon. So what happens is that you end up constantly growing the heap to
>> generate free blocks.
>>
>
> True but there are limits eg partially filled blocks are reused . Be nice
> to get a handle on this eg is the heap cost for extra free blocks needed to
> cover the fragmentation is 20% more ( acceptable) or is it 100% ( not
> acceptable) .
>

That's easy to get. Just count the number of total block allocations and
the number of total blocks returned to the free block list (as opposed to
the recycled block list). The expectation is that *every* block that was
returned to the free block list would instead have required heap growth.

Or just turn off relocation and check max heap size. Wasn't there something
similar in the RC-immix paper?


>
>
>> Or you can size-partition the heap to avoid the fragmentation. But that
>> costs you a 28% increase in L1 cache misses and likely a bunch more in TLB
>> misses.
>>
>
> As per URC . Its a cost but bareable ( based on comparison to MMTk) ..
> malloc normally goes to a size partitioned heap ( SLAB/SLUB) unless that
> has changed again. And you can get something back as the vtable field can
> be avoided for many objects.
>

I'm not sure what size partitioning has to do with the vTable field. But
that 28% is *in addition to* all the other costs that are going on. That
28% is what's paying for the write barrier costs. You can't afford to lose
it in a credible system.

I do think we need to get a handle of the cost of the read barrier /page
> faults  vs the 28% increase in L1 cache misses  (+TLB) for non nusery
> objects at varying nursery sizes.  And its hard to do that without building
> it .. sigh..
>

Yeah, and it's the wrong comparison to boot. First, the page fault solution
*is* a read barrier, so it's not a dichotomy sort of comparison. The issue
is this:

   - An inline read barrier is incurred on every object whose fields are
   sourced.
   - A page-faulting read barrier is only incurred on objects that are (a)
   being forwarded, (b) are dynamically referenced during the window of time
   when forwarding has not completed, and (c) have the property that their
   references are getting copied around as forwarding is underway.

If you have access to page reference aging data, you can tune your
evacuation policy to under-referenced blocks. But no matter what you do,
you are *sometimes* going to incur overhead.

The page fault analysis doesn't have one answer. Differences in page fault
handling performance on different OS's can be a factor of 10 apart, and
there are some approaches available now that haven't even been tried (or at
least there's no evidence in the literature).


>
>>
>>
>>> Isnt a simple but short stop the world almost as effective , and for the
>>> Nursery sweep in parralel ? Then you only need a write barrier for heap
>>> objects references to Nursery which is pretty cheap.  Possibly you can
>>> remove  most of the Nursery pauses by using a write barrier and relying on
>>> Thread local nurseries ( we discussed some mechanisms ) .
>>>
>>
>> It isn't, because the nursery isn't the source of the problem here.
>> Forwarding within the nursery can all be done during nursery sweep (as you
>> suggest). The problem is that for a large heap you need to do concurrent
>> collection in the background, which includes forwarding objects to render
>> low-occupancy blocks free. Thus the read barrier, because the mutator is
>> running concurrently.
>>
>
> Ref counting can take care of freeing most objects so we are talking about
> moving a few objects .  A read barrier just feels wrong.
>
> Why cant we :
> - Identify some blocks with low usage up to a limit.
> - Do a concurrent mark for the main heap , record objects that refer to
> these selected objects
> - Synch with the nursery  so that  before the nursery is swept we identify
> pointers to these objects
> - mod buf write barrier used in between mark  and nursery sweep . If Mark
> hasnt run and there is no ref to the nursery  then it just returns.
> -  Lock the world in parralel , mark the remainder of the nursery .
>  Complete scan stack ( if doing incremental)  and scan modbuf add to
> referance list . Move the objects  , update references to those objects ,
> sweep the nursery.  Should be a very small pause maybe only add 20% to
> normal nursery sweep time. .
>

There is a long list of things you can do to reduce the "lock the world"
interval, but the root problem is that you are locking the world at all. In
an application having tens of threads, locking the world - just the
locking, not actually doing anything here - can *easily* take >100ms.
Consider that some of those threads have been context switched...

A compromise would be for the GC to identify blocks that will be evacuated
>> at the start of collection, accumulate a points-to map during tracing,
>> pause all the mutators, forward the objects, and explicitly back-patch the
>> points-to locations. Most of the objects will forward, and the reference
>> counts will usually confirm that we got all of the in-bound pointers. When
>> we didn't, we institute a heavy (concurrent) read barrier with the page
>> fault system. This approach works for a small number of active threads. The
>> problem with it is that it really doesn't scale.
>>
>
> Thats similar to what i just wrote.    Why does it not scale ?
>

Because "lock the world" doesn't scale.


Ben, you keep talking about a v1 implementation and moving forward. I can
understand the eagerness to do that, but it's a mistake. We're still
identifying issues that limit or influence our choices about language-level
semantics. We don't necessarily need to *build* the ultimate GC first time
through, but we *do* need to avoid boxing ourself into a corner on
language-level semantics.


shap

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Interior references

Reply via email to