> > > > So either *all* objects (interior and exterior) have a GC header at > obref[-n], or the write barrier can't examine the object header. And that's > mostly fine. Except that deferred reference counting needs the 'M' bit, and > handling forwarding needs the 'F' bit, so I'm not convinced that getting > rid of the header is an option. If we declare that *all* objects have a > vTable slot then I guess we could use the low-order bits, but that entails > layout issues too, because not all objects have vTable pointers. Notably: > CLR value types do not carry a /this/ pointer. >
Which means you loose the 1:1 native rep . So you have to either have a seperate base type that cant take an internal reference or do more fancy marshalling ? Both have a high price . > I wrote yesterday that by-ref is an interior object reference in disguise > (at the implementation level). This is incorrect. I was wondering how C# > got away with this, and the answer is sneaky: > > 1. All unboxed types are value (struct) types > 2. Within a method of a value type 'vt, /this/ is not an object > reference. It has type REF 'vt. Which means that an inner reference cannot > leak. > 3. Structs cannot derive from each other in either direction. The *only > * derivation is from System.ValueType. This means that there can only > be one vTable for a given struct type, so the vTable pointer isn't needed, > even if the struct overrides virtual methods of System.Object > > It's a clever set of rules, but even with these rules they ended up > needing to box and unbox the structs a lot. > But thats my point on boxing , it has little to do with not taking an internal reference. You need a box for - "Everything is an Object" so value types are cast as objects , V1 did not have generics . - Cast to interfaces . Probably a bad idea to cast value types as interfaces and certainly not needed with type classes. - There are some rarer cases Im not convinced by everything is an object this was needed for collections in v1 but again its not needed with type classes. We can handle internal references to an array via slices. > > We *have* a better term: array. The boxed version of an array should be > called a vector. Unfortunately the good people at Microsoft have a limited > appreciation of lexicography. > In this case its more Javas fault.. they used Vector for an expanding collection , if they used Vector it would have created confusion from people used to Java when add(element) didnt work. I will try to use Array - unboxed Array . Do we need to distinguish Embedded vs allocated seperately to the object Vector - Ref Array List - Expandable ref Array Linked list - Node based list I think ref object and value object are good. I think value types in other ref objects should be termed embeded value object ( embedded object for short) but once you can take an internal reference this terminology also becomes muddled. > > Actually, it's not whether the array is boxed or unboxed that's mainly at > issue. The main issue is whether the array is dynamically or statically > sized. Where we need to disambiguate, perhaps we could use "fixed array" > and "dynamic array". > > > >> Unfortunately, this is *exactly* the case we need to be concerned >>> about, and the C# solution doesn't help. >From the implementation >>> perspective, passing by reference is just syntactic sugar for an >>> [potentially interior] object reference. You know that there is a >>> dominating exterior object reference that is live, so you don't need to >>> *mark* the by-ref references, but in a relocating collector you still >>> need to relocate them. That can be done with fat pointers, I suppose. As >>> you note, one advantage of by-reference is that it is a non-leaking >>> reference, so you don't have to worry about them propagating into the heap. >>> >> >> C# is relocating and I dont think uses fat pointers. Maybe because >> gen0 and gen1 collections ( which are fast ) still stop the world so they >> can just scan the stack and find these reference ( we have the object >> reference and object size ) so you can either go the fat pointer / slice >> route or maybe search the stack for these ( a bit like your find the block >> and work back but in this case you search the stackmap) >> > > Yeah. So I looked, and it turns out that they got this right, and I'm > completely wrong about by-reference being an object reference - the /this/ > pointer of a value type 'vt is typed as REF 'vt rather than OBREF 'vt. So > it can't escape. As long as it can't escape, it's not that big a problem. > > If we're trying to avoid the header on interior unboxed objects, we don't > need it for finding the base object. Note that objects are a minimum of 4 > words (when alignment requirements are considered). On a 32-bit system > that's 16 bytes, in which case a 128byte immix line holds 8 objects. How > convenient that there are 8 bits in the per-line metadata byte. Instead of > using an object count, we can use a bit to mark which 4-word boundaries > correspond to the start of an object. Given such bits, we can walk > backwards fairly cheaply to find containing objects. > That seems a better strategy > > So now we're considering a model in which maybe only REF 'vt can be an > inner reference. Except we still have forwarding to deal with on the > exterior object, so we still need a write barrier, and the write barrier in > this case needs to be able to find the containing object. > > >> I still think at least for v1 embedded unboxed arrays should be ref >> types or value types restricted to having no references it makes things a >> lot more simpler and can be extended later... >> > >>> Except it doesn't meet the requirements for one of the early things I >>> need to build. >>> >> >> Java has no value types at all and is pretty fast , its not needed just >> an optomization >> > > It is NOT just an optimization. The absence of general unboxed types is a > fundamental expressiveness limitation. You simply cannot write systems code > in any sensible way without it. I* *challenge *anyone* to re-code the > Coyotos kernel with comparable efficiency in Java. You get to tell the > compiler that no collector is required, so that's a free 25%-30% advantage > over conventional Java right there. > > It. Can't. Be. Done. > > Java lacks the expressiveness to describe the data structures. > But we are not talking about just using it for system structures .. as has been noted C# fixed and referenceless structs are enough . We are adding this to increase the expressiveness of the language. Internal references changing the concept of these types from Value Objects to gc tracked embeded objects ( with identities) to take a reference. They are meant to be cheap and light like native types ( and that includes a sub array) and normally copied by value . Sure C# allows you to take a reference as an optomization which may be usefull but its used by less than 1% of applications. And there is a lot of overhead it was already pointed out that every header word in 32 bit added 3% , then you have the other costs ( like GC tracking , looping type reference lookup vector ) etc. > > >> Which means in a single threaded context ( with the collector on another >> thread) if it can be identified you dont need the lock. >> > > Yes. Read and write locks aren't necessarily GC primitives, but if we end > up needing dynamic translation, they should definitely be inlined > intrinsics. > > Though it should be said that compare-and-swap on a single thread actually > isn't that expensive. > In low contention its cheap which should be the case here.. as the collector rarely runs . . > > >> The missing part of this story is that it doesn't address forwarded >>> pointers that get loaded out of objects residing in the heap. If you have >>> to do read locking, you can deal with it then, but otherwise those remain >>> an issue. Note that taking a read lock amounts to a read barrier. >>> >>> >>> >> So how does it remove the read barrier ? Isnt checking the lock ( not >> just taking it ) a read barrier ? >> > > Previously, you needed the read barrier *in addition to* the lock. The > point here is that if you are locking anyway you can fuse them. > > The problem with this approach is that it assumes a per-object lock, which > is not how I want things to work. > A lock on the line is nice for the collector but would be expensive for the mutator > > >> Its very complex and im struggling to follow it ( as i dont have time to >> read up on all this at the moment) but it seems strange to me that such an >> expensive mechanism (read barrier /locks) is needed for something that runs >> so rare and mainly to assist in defragmentation of the heap... >> > > Yes. It's a very high cost. The feature that causes it is object > relocation - you need a way to make sure that you detect the fact that an > object is forwarded. > > But now that a lot of the cases are handled in the store barrier, I think > that the *rest* can be handled with page faults if we really need to. > +1. That looks promising ( even if it does mean lots of page changes to the kernel) .. I asume then you could just get the lock ( page level) in the page fault . And when the collector is not running your just back to the write barrier. > > >> Is the treatment worse the disease (fragmentation causing extra memory )? >> > > Unfortunately, the data says it isn't. The problem is that 10% of objects > do not die young. That's true in the second generation too. So in a given > 32KB block you might have as many as 1K objects, 100 of which don't die > very fast. That block can be recycled, but it won't become free any time > soon. So what happens is that you end up constantly growing the heap to > generate free blocks. > True but there are limits eg partially filled blocks are reused . Be nice to get a handle on this eg is the heap cost for extra free blocks needed to cover the fragmentation is 20% more ( acceptable) or is it 100% ( not acceptable) . > > Or you can size-partition the heap to avoid the fragmentation. But that > costs you a 28% increase in L1 cache misses and likely a bunch more in TLB > misses. > As per URC . Its a cost but bareable ( based on comparison to MMTk) .. malloc normally goes to a size partitioned heap ( SLAB/SLUB) unless that has changed again. And you can get something back as the vtable field can be avoided for many objects . I do think we need to get a handle of the cost of the read barrier /page faults vs the 28% increase in L1 cache misses (+TLB) for non nusery objects at varying nursery sizes. And its hard to do that without building it .. sigh.. > > >> Isnt a simple but short stop the world almost as effective , and for the >> Nursery sweep in parralel ? Then you only need a write barrier for heap >> objects references to Nursery which is pretty cheap. Possibly you can >> remove most of the Nursery pauses by using a write barrier and relying on >> Thread local nurseries ( we discussed some mechanisms ) . >> > > It isn't, because the nursery isn't the source of the problem here. > Forwarding within the nursery can all be done during nursery sweep (as you > suggest). The problem is that for a large heap you need to do concurrent > collection in the background, which includes forwarding objects to render > low-occupancy blocks free. Thus the read barrier, because the mutator is > running concurrently. > Ref counting can take care of freeing most objects so we are talking about moving a few objects . A read barrier just feels wrong. Why cant we : - Identify some blocks with low usage up to a limit. - Do a concurrent mark for the main heap , record objects that refer to these selected objects - Synch with the nursery so that before the nursery is swept we identify pointers to these objects - mod buf write barrier used in between mark and nursery sweep . If Mark hasnt run and there is no ref to the nursery then it just returns. - Lock the world in parralel , mark the remainder of the nursery . Complete scan stack ( if doing incremental) and scan modbuf add to referance list . Move the objects , update references to those objects , sweep the nursery. Should be a very small pause maybe only add 20% to normal nursery sweep time. . The more nursery evacuations the more objects we free. > A compromise would be for the GC to identify blocks that will be evacuated > at the start of collection, accumulate a points-to map during tracing, > pause all the mutators, forward the objects, and explicitly back-patch the > points-to locations. Most of the objects will forward, and the reference > counts will usually confirm that we got all of the in-bound pointers. When > we didn't, we institute a heavy (concurrent) read barrier with the page > fault system. This approach works for a small number of active threads. The > problem with it is that it really doesn't scale. > Thats similar to what i just wrote. Why does it not scale ? You can increase the nursery size and increase concurrency at a cost in longer pauses . Or reduce the max number of blocks moved for more fragmentation but less costs. Seems a solid v1 solution without read barrier or page faults to worry about. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
