Re: [bitc-dev] Interior references

Ben Kloosterman Tue, 17 Dec 2013 13:30:24 -0800

>
>
>>
> Or you give up on any technique that requires a barrier. Which means
> giving up deferred reference counting.
>


URC is a diffirent way of doing it .A write barrier which is rarely active
is ok .


>
> After writing my clarification about by-ref parameters, I realized that we
> need to do write barriers on those as well, and those write barriers won't
> be cheap, because they potentially have to find the object root. It's
> beginning to smell like REF 'vt types only work because they aren't used
> all that often.
>

Why do we need write barriers on  By ref params which never escape the
stack ( and likely never even escape registers) , there is never a returned
value?  I supose this gets into objects that are pointed to from roots and
whether they can be moved.


>
>
>> But thats my point on boxing   , it has little to do with not taking an
>> internal reference. You need a box for
>> - "Everything is an Object" so value types are cast as objects , V1  did
>> not have generics .
>> -  Cast to interfaces .  Probably a bad idea to cast value types as
>> interfaces and certainly not needed with type classes.
>> - There are some rarer cases
>>
>> Im not convinced by everything is an object this was needed for
>> collections in v1 but again its not needed with type classes.   We can
>> handle internal references to an array via slices.
>>
>
> Do you mean here that "everything is a descendant of object"? If so, I
> agree that was mistake. You want a "bottom" in the type system, but I don't
> think Object was the right choice.
>

WIth subtyping you use everything is an object ( is command) . I think you
want bottoms ie valueobject and referenceobject but a unified bottom seems
problemetic ..


>
>>> Java lacks the expressiveness to describe the data structures.
>>>
>>
>> But we are not talking about just using it for system structures .. as
>> has been noted C# fixed and referenceless  structs are enough .
>>
>
> I agree: we're not talking about using it just for system problems. But
> the *ability* to use BitC for systems problems is one of the big
> differences in comparison to other systems. Because of this, it's
> unfortunate that reference-free structs and "fixed" *aren't* good enough.
>
> Fixed is unacceptable because it is unsafe. Reference-free structs are not
> good enough for writing an operating system or other low-level systems
> codes. Go have a look at the kinds of data structures we're talking about
> here:
>
> http://dev.eros-os.com/hg/coyotos/trunk/file/tip/src/sys/kerninc
>
>
> The Processs and GPT structures are good ones to examine. Also how these
> relate to ObjectHeader.
>

Fixed does not need to be unsafe its only because of interior references a
fat pointer is a simple and effective solution which will have a low cost
especially if masked.

We are not talking about reference free structs  ( C# array have them)  ,
only arrays that have either references or reference free value types.

Actually i cant see a clear issue with holding references here ,  the issue
was the mark needs to be aware of this  and the precise collection vector
would need a loop  but this is not needed . You can calculate a standard
bitmap and set them  as long as the array is fixed size and there is no
cycles.

In C kernels you dont really have a heap and yes you have pointers to
statically allocated structures ( ok there is a kalloc but its not like a
heap and you cant use moving objects) .

Sorry if this is obvious but its for my own benefit to  help clear things
up .

Ok what you want is the linked list / tree as (static ?)  value types not
reference types but this cant be allowed since it leads to a non fixed size
object . Hence most linked lists/ trees need to either use an index or use
a reference type as the node  .  ( Which is why slice works well as it uses
an index but has problems with lock free algos) . You realize this but want
to add an interal reference to a value type.

So what you want is

    35 
<http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l35>
struct Link {

    36 
<http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l36>
  struct Link *next;

    37 
<http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l37>
  struct Link *prev;

    38 
<http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l38>
} ;


And have something like
class myObj
{
Link[128] links;
*Link firstNode; // Note not Link firstNode
}'

First issue is we need to distinguish  Link ( vs an embedded value link) to
Link* ( as in a reference) .. clearly this is no longer a value type  its
either a pointer ( unsafe) or a reference type . We had already concluded
its a diffirent type.

So why dont we want to use a heap reference type ? I dont entirely buy
locality , the whole purpose of nursery bump TLA is so the objects are
local thats your higher L1 cache hit rate  ( vs  a sized ordered heap :--)
 though in some cases the seperation will be higher  but it will also be if
you have large arrays in objects then other parts of the object could
suffer much worse locality .  So the internal arrays are of marginal
benefit here.  We dont want to use a a heap object because the header
overhead on each node if you add the header to each subject than its no
diffirent then your 1 byte header for heap objects.   If there was no
header  ( as per some objects in the paper  i sent  which need no header in
the nursery and can get the vtable from virtual memory) then storage and
access of a vector would be almost the same  as C you would have  almost
the same layout just the seperation of the header. . eg

class myObj
{
Vector<Link>[128] links;  // firstnode = links[0]; can be seperate
}

If Link is a reference type then we only have the locality cost  imposed by
the header ( which may be 0 )  Except for the first access its via prev or
next.  Note here we could potentially store these high performance objects
in a special block which  uses virtual memory to find the Vtable .

Again value types are not the right tool for this - they apear that way as
they are embedded but they are usefull for small  copyable objects like
points , byte buffers and native representation.

So it comes down to embeddedObjects vs heapObjects and if the header can be
made small enough then they embeddedobjects serve little purpose.



> In any case, even if the structs are reference-free, we still need the 'M'
> bit to efficiently support deferred reference counting.
>



>
>  We are adding this to increase the expressiveness of the language.
>>
>
> That's one view. The other view is that we are trying to address this in
> order to bring the expressiveness of Java and C# more in line with that of
> C.
>
> And there is a lot of overhead   it was already pointed out that every
>> header word in 32 bit added 3% , then you have the other costs ( like GC
>> tracking , looping type reference lookup vector ) etc.
>>
>
> It's starting to look like I can do the GC object header in a single byte.
>

Which has a high chance of  having little cost due to allignment.

>
>
>
>
>>  Unfortunately, the data says it isn't. The problem is that 10% of
>>> objects do not die young. That's true in the second generation too. So in a
>>> given 32KB block you might have as many as 1K objects, 100 of which don't
>>> die very fast. That block can be recycled, but it won't become free any
>>> time soon. So what happens is that you end up constantly growing the heap
>>> to generate free blocks.
>>>
>>
>> True but there are limits eg partially filled blocks are reused . Be nice
>> to get a handle on this eg is the heap cost for extra free blocks needed to
>> cover the fragmentation is 20% more ( acceptable) or is it 100% ( not
>> acceptable) .
>>
>
> That's easy to get. Just count the number of total block allocations and
> the number of total blocks returned to the free block list (as opposed to
> the recycled block list). The expectation is that *every* block that was
> returned to the free block list would instead have required heap growth.
>
> Or just turn off relocation and check max heap size. Wasn't there
> something similar in the RC-immix paper?
>

Yes there is no relocation until fragmention is needed..

>
>
>>
>>
>>> Or you can size-partition the heap to avoid the fragmentation. But that
>>> costs you a 28% increase in L1 cache misses and likely a bunch more in TLB
>>> misses.
>>>
>>
>> As per URC . Its a cost but bareable ( based on comparison to MMTk) ..
>> malloc normally goes to a size partitioned heap ( SLAB/SLUB) unless that
>> has changed again. And you can get something back as the vtable field can
>> be avoided for many objects.
>>
>
> I'm not sure what size partitioning has to do with the vTable field. But
> that 28% is *in addition to* all the other costs that are going on. That
> 28% is what's paying for the write barrier costs. You can't afford to lose
> it in a credible system.
>

Im not against the write barrier  , this write barrier is far cheaper than
a cardtable as most of the time it does nothing.  re size partitioning
you can partition using VM by type ( a sub of size) for some objects
removing vtable  .


The issue is this:
>
>    - An inline read barrier is incurred on every object whose fields are
>    sourced.
>    - A page-faulting read barrier is only incurred on objects that are
>    (a) being forwarded, (b) are dynamically referenced during the window of
>    time when forwarding has not completed, and (c) have the property that
>    their references are getting copied around as forwarding is underway.
>
> If you have access to page reference aging data, you can tune your
> evacuation policy to under-referenced blocks. But no matter what you do,
> you are *sometimes* going to incur overhead.
>
> The page fault analysis doesn't have one answer. Differences in page fault
> handling performance on different OS's can be a factor of 10 apart, and
> there are some approaches available now that haven't even been tried (or at
> least there's no evidence in the literature).
>
>
>>
>>>
>>>
>>>> Isnt a simple but short stop the world almost as effective , and for
>>>> the Nursery sweep in parralel ? Then you only need a write barrier for heap
>>>> objects references to Nursery which is pretty cheap.  Possibly you can
>>>> remove  most of the Nursery pauses by using a write barrier and relying on
>>>> Thread local nurseries ( we discussed some mechanisms ) .
>>>>
>>>
>>> It isn't, because the nursery isn't the source of the problem here.
>>> Forwarding within the nursery can all be done during nursery sweep (as you
>>> suggest). The problem is that for a large heap you need to do concurrent
>>> collection in the background, which includes forwarding objects to render
>>> low-occupancy blocks free. Thus the read barrier, because the mutator is
>>> running concurrently.
>>>
>>
>> Ref counting can take care of freeing most objects so we are talking
>> about moving a few objects .  A read barrier just feels wrong.
>>
>> Why cant we :
>> - Identify some blocks with low usage up to a limit.
>> - Do a concurrent mark for the main heap , record objects that refer to
>> these selected objects
>> - Synch with the nursery  so that  before the nursery is swept we
>> identify pointers to these objects
>> - mod buf write barrier used in between mark  and nursery sweep . If Mark
>> hasnt run and there is no ref to the nursery  then it just returns.
>> -  Lock the world in parralel , mark the remainder of the nursery .
>>  Complete scan stack ( if doing incremental)  and scan modbuf add to
>> referance list . Move the objects  , update references to those objects ,
>> sweep the nursery.  Should be a very small pause maybe only add 20% to
>> normal nursery sweep time. .
>>
>
>
>  A compromise would be for the GC to identify blocks that will be
>>> evacuated at the start of collection, accumulate a points-to map during
>>> tracing, pause all the mutators, forward the objects, and explicitly
>>> back-patch the points-to locations. Most of the objects will forward, and
>>> the reference counts will usually confirm that we got all of the in-bound
>>> pointers. When we didn't, we institute a heavy (concurrent) read barrier
>>> with the page fault system. This approach works for a small number of
>>> active threads. The problem with it is that it really doesn't scale.
>>>
>>
>> Thats similar to what i just wrote.    Why does it not scale ?
>>
>
> Because "lock the world" doesn't scale.
>

> There is a long list of things you can do to reduce the "lock the world"
> interval, but the root problem is that you are locking the world at all. In
> an application having tens of threads, locking the world - just the
> locking, not actually doing anything here - can *easily* take >100ms.
> Consider that some of those threads have been context switched...
>

120ms  pauses is fine in 95% of apps .   Build a better collector for the
rest or just remove the nursery in v1  ( which is basically real time and
games) .

Im  not convinded it doesnt scale.
- The pause cost is per thread as you add more threads the cost as a % of
total execution time remains a almost fixed  percentage. ( You could argue
the total amount wasted is higher )
- The size of nursery is bounded and hence the pause is independent of the
heap size.  You can have no nursery giving low stops ( I reread the Immix
paper and im convinced Immix doesnt have a nursery - im not sure about
rc-immix but i think its the same they just copy the new objects more ie
they just have a local allocator which allocates main heap blocks)
- Context switches can be massively reduced by using idle threads this will
become more important when we move to 10-20 cores.
- I do see some issue with more threads filling a Nursery more often (
without resorting to seperate nurseries that need to be synced)   but this
should be balanced by more threads to do the parralel mark .


> Ben, you keep talking about a v1 implementation and moving forward. I can
> understand the eagerness to do that, but it's a mistake. We're still
> identifying issues that limit or influence our choices about language-level
> semantics. We don't necessarily need to *build* the ultimate GC first
> time through, but we *do* need to avoid boxing ourself into a corner on
> language-level semantics.
>

Its part habit i  like to get thing out there, as a crapy product is better
than no product ( look at javascript its actually comming together now)
 then itterate. I also tend to read a bit and then build it and as i build
 i reread more thoroughly and with benchmarks my knowledge becomes
significantly higher.   I do appreciate the detailed analysis your doing
 ie the discussions on interior references and rc-Immix has been very
useful  but im not sure about GC concurrent options , a new algorithm may
come , which requires diffirent language structure . There is very little
research on 10-20+ Core garbage collection but it will come.

That said i cant stop myself looking for a v1 implimentation that will lead
to a good v2 ( even if not the best v2 option) .

Ben

_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Re: [bitc-dev] Interior references

Reply via email to