> > >> > Or you give up on any technique that requires a barrier. Which means > giving up deferred reference counting. >
URC is a diffirent way of doing it .A write barrier which is rarely active is ok . > > After writing my clarification about by-ref parameters, I realized that we > need to do write barriers on those as well, and those write barriers won't > be cheap, because they potentially have to find the object root. It's > beginning to smell like REF 'vt types only work because they aren't used > all that often. > Why do we need write barriers on By ref params which never escape the stack ( and likely never even escape registers) , there is never a returned value? I supose this gets into objects that are pointed to from roots and whether they can be moved. > > >> But thats my point on boxing , it has little to do with not taking an >> internal reference. You need a box for >> - "Everything is an Object" so value types are cast as objects , V1 did >> not have generics . >> - Cast to interfaces . Probably a bad idea to cast value types as >> interfaces and certainly not needed with type classes. >> - There are some rarer cases >> >> Im not convinced by everything is an object this was needed for >> collections in v1 but again its not needed with type classes. We can >> handle internal references to an array via slices. >> > > Do you mean here that "everything is a descendant of object"? If so, I > agree that was mistake. You want a "bottom" in the type system, but I don't > think Object was the right choice. > WIth subtyping you use everything is an object ( is command) . I think you want bottoms ie valueobject and referenceobject but a unified bottom seems problemetic .. > >>> Java lacks the expressiveness to describe the data structures. >>> >> >> But we are not talking about just using it for system structures .. as >> has been noted C# fixed and referenceless structs are enough . >> > > I agree: we're not talking about using it just for system problems. But > the *ability* to use BitC for systems problems is one of the big > differences in comparison to other systems. Because of this, it's > unfortunate that reference-free structs and "fixed" *aren't* good enough. > > Fixed is unacceptable because it is unsafe. Reference-free structs are not > good enough for writing an operating system or other low-level systems > codes. Go have a look at the kinds of data structures we're talking about > here: > > http://dev.eros-os.com/hg/coyotos/trunk/file/tip/src/sys/kerninc > > > The Processs and GPT structures are good ones to examine. Also how these > relate to ObjectHeader. > Fixed does not need to be unsafe its only because of interior references a fat pointer is a simple and effective solution which will have a low cost especially if masked. We are not talking about reference free structs ( C# array have them) , only arrays that have either references or reference free value types. Actually i cant see a clear issue with holding references here , the issue was the mark needs to be aware of this and the precise collection vector would need a loop but this is not needed . You can calculate a standard bitmap and set them as long as the array is fixed size and there is no cycles. In C kernels you dont really have a heap and yes you have pointers to statically allocated structures ( ok there is a kalloc but its not like a heap and you cant use moving objects) . Sorry if this is obvious but its for my own benefit to help clear things up . Ok what you want is the linked list / tree as (static ?) value types not reference types but this cant be allowed since it leads to a non fixed size object . Hence most linked lists/ trees need to either use an index or use a reference type as the node . ( Which is why slice works well as it uses an index but has problems with lock free algos) . You realize this but want to add an interal reference to a value type. So what you want is 35 <http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l35> struct Link { 36 <http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l36> struct Link *next; 37 <http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l37> struct Link *prev; 38 <http://dev.eros-os.com/hg/coyotos/trunk/file/eec49128e266/src/sys/kerninc/Link.h#l38> } ; And have something like class myObj { Link[128] links; *Link firstNode; // Note not Link firstNode }' First issue is we need to distinguish Link ( vs an embedded value link) to Link* ( as in a reference) .. clearly this is no longer a value type its either a pointer ( unsafe) or a reference type . We had already concluded its a diffirent type. So why dont we want to use a heap reference type ? I dont entirely buy locality , the whole purpose of nursery bump TLA is so the objects are local thats your higher L1 cache hit rate ( vs a sized ordered heap :--) though in some cases the seperation will be higher but it will also be if you have large arrays in objects then other parts of the object could suffer much worse locality . So the internal arrays are of marginal benefit here. We dont want to use a a heap object because the header overhead on each node if you add the header to each subject than its no diffirent then your 1 byte header for heap objects. If there was no header ( as per some objects in the paper i sent which need no header in the nursery and can get the vtable from virtual memory) then storage and access of a vector would be almost the same as C you would have almost the same layout just the seperation of the header. . eg class myObj { Vector<Link>[128] links; // firstnode = links[0]; can be seperate } If Link is a reference type then we only have the locality cost imposed by the header ( which may be 0 ) Except for the first access its via prev or next. Note here we could potentially store these high performance objects in a special block which uses virtual memory to find the Vtable . Again value types are not the right tool for this - they apear that way as they are embedded but they are usefull for small copyable objects like points , byte buffers and native representation. So it comes down to embeddedObjects vs heapObjects and if the header can be made small enough then they embeddedobjects serve little purpose. > In any case, even if the structs are reference-free, we still need the 'M' > bit to efficiently support deferred reference counting. > > > We are adding this to increase the expressiveness of the language. >> > > That's one view. The other view is that we are trying to address this in > order to bring the expressiveness of Java and C# more in line with that of > C. > > And there is a lot of overhead it was already pointed out that every >> header word in 32 bit added 3% , then you have the other costs ( like GC >> tracking , looping type reference lookup vector ) etc. >> > > It's starting to look like I can do the GC object header in a single byte. > Which has a high chance of having little cost due to allignment. > > > > >> Unfortunately, the data says it isn't. The problem is that 10% of >>> objects do not die young. That's true in the second generation too. So in a >>> given 32KB block you might have as many as 1K objects, 100 of which don't >>> die very fast. That block can be recycled, but it won't become free any >>> time soon. So what happens is that you end up constantly growing the heap >>> to generate free blocks. >>> >> >> True but there are limits eg partially filled blocks are reused . Be nice >> to get a handle on this eg is the heap cost for extra free blocks needed to >> cover the fragmentation is 20% more ( acceptable) or is it 100% ( not >> acceptable) . >> > > That's easy to get. Just count the number of total block allocations and > the number of total blocks returned to the free block list (as opposed to > the recycled block list). The expectation is that *every* block that was > returned to the free block list would instead have required heap growth. > > Or just turn off relocation and check max heap size. Wasn't there > something similar in the RC-immix paper? > Yes there is no relocation until fragmention is needed.. > > >> >> >>> Or you can size-partition the heap to avoid the fragmentation. But that >>> costs you a 28% increase in L1 cache misses and likely a bunch more in TLB >>> misses. >>> >> >> As per URC . Its a cost but bareable ( based on comparison to MMTk) .. >> malloc normally goes to a size partitioned heap ( SLAB/SLUB) unless that >> has changed again. And you can get something back as the vtable field can >> be avoided for many objects. >> > > I'm not sure what size partitioning has to do with the vTable field. But > that 28% is *in addition to* all the other costs that are going on. That > 28% is what's paying for the write barrier costs. You can't afford to lose > it in a credible system. > Im not against the write barrier , this write barrier is far cheaper than a cardtable as most of the time it does nothing. re size partitioning you can partition using VM by type ( a sub of size) for some objects removing vtable . The issue is this: > > - An inline read barrier is incurred on every object whose fields are > sourced. > - A page-faulting read barrier is only incurred on objects that are > (a) being forwarded, (b) are dynamically referenced during the window of > time when forwarding has not completed, and (c) have the property that > their references are getting copied around as forwarding is underway. > > If you have access to page reference aging data, you can tune your > evacuation policy to under-referenced blocks. But no matter what you do, > you are *sometimes* going to incur overhead. > > The page fault analysis doesn't have one answer. Differences in page fault > handling performance on different OS's can be a factor of 10 apart, and > there are some approaches available now that haven't even been tried (or at > least there's no evidence in the literature). > > >> >>> >>> >>>> Isnt a simple but short stop the world almost as effective , and for >>>> the Nursery sweep in parralel ? Then you only need a write barrier for heap >>>> objects references to Nursery which is pretty cheap. Possibly you can >>>> remove most of the Nursery pauses by using a write barrier and relying on >>>> Thread local nurseries ( we discussed some mechanisms ) . >>>> >>> >>> It isn't, because the nursery isn't the source of the problem here. >>> Forwarding within the nursery can all be done during nursery sweep (as you >>> suggest). The problem is that for a large heap you need to do concurrent >>> collection in the background, which includes forwarding objects to render >>> low-occupancy blocks free. Thus the read barrier, because the mutator is >>> running concurrently. >>> >> >> Ref counting can take care of freeing most objects so we are talking >> about moving a few objects . A read barrier just feels wrong. >> >> Why cant we : >> - Identify some blocks with low usage up to a limit. >> - Do a concurrent mark for the main heap , record objects that refer to >> these selected objects >> - Synch with the nursery so that before the nursery is swept we >> identify pointers to these objects >> - mod buf write barrier used in between mark and nursery sweep . If Mark >> hasnt run and there is no ref to the nursery then it just returns. >> - Lock the world in parralel , mark the remainder of the nursery . >> Complete scan stack ( if doing incremental) and scan modbuf add to >> referance list . Move the objects , update references to those objects , >> sweep the nursery. Should be a very small pause maybe only add 20% to >> normal nursery sweep time. . >> > > > A compromise would be for the GC to identify blocks that will be >>> evacuated at the start of collection, accumulate a points-to map during >>> tracing, pause all the mutators, forward the objects, and explicitly >>> back-patch the points-to locations. Most of the objects will forward, and >>> the reference counts will usually confirm that we got all of the in-bound >>> pointers. When we didn't, we institute a heavy (concurrent) read barrier >>> with the page fault system. This approach works for a small number of >>> active threads. The problem with it is that it really doesn't scale. >>> >> >> Thats similar to what i just wrote. Why does it not scale ? >> > > Because "lock the world" doesn't scale. > > There is a long list of things you can do to reduce the "lock the world" > interval, but the root problem is that you are locking the world at all. In > an application having tens of threads, locking the world - just the > locking, not actually doing anything here - can *easily* take >100ms. > Consider that some of those threads have been context switched... > 120ms pauses is fine in 95% of apps . Build a better collector for the rest or just remove the nursery in v1 ( which is basically real time and games) . Im not convinded it doesnt scale. - The pause cost is per thread as you add more threads the cost as a % of total execution time remains a almost fixed percentage. ( You could argue the total amount wasted is higher ) - The size of nursery is bounded and hence the pause is independent of the heap size. You can have no nursery giving low stops ( I reread the Immix paper and im convinced Immix doesnt have a nursery - im not sure about rc-immix but i think its the same they just copy the new objects more ie they just have a local allocator which allocates main heap blocks) - Context switches can be massively reduced by using idle threads this will become more important when we move to 10-20 cores. - I do see some issue with more threads filling a Nursery more often ( without resorting to seperate nurseries that need to be synced) but this should be balanced by more threads to do the parralel mark . > Ben, you keep talking about a v1 implementation and moving forward. I can > understand the eagerness to do that, but it's a mistake. We're still > identifying issues that limit or influence our choices about language-level > semantics. We don't necessarily need to *build* the ultimate GC first > time through, but we *do* need to avoid boxing ourself into a corner on > language-level semantics. > Its part habit i like to get thing out there, as a crapy product is better than no product ( look at javascript its actually comming together now) then itterate. I also tend to read a bit and then build it and as i build i reread more thoroughly and with benchmarks my knowledge becomes significantly higher. I do appreciate the detailed analysis your doing ie the discussions on interior references and rc-Immix has been very useful but im not sure about GC concurrent options , a new algorithm may come , which requires diffirent language structure . There is very little research on 10-20+ Core garbage collection but it will come. That said i cant stop myself looking for a v1 implimentation that will lead to a good v2 ( even if not the best v2 option) . Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
