I've spent some free brain cycles today thinking about this and here's the scheme I have in mind. If anyone thinks this could be improved in a way that would not have substantial ripple effects throughout the compiler/language (because then it might never actually get implemented) let me know.
1. GC.malloc's signature changes to GC.malloc(size_t size, uint ba = 0, size_t* bitmask = null). A null bitmask means use the old-school behavior and either scan everything or don't scan anything based on the NO_SCAN bit. IMHO plain old conservative scanning must be supported to allow for untyped memory blocks to be allocated, because requiring every memory block to have a type associated with it is an unacceptable limitation in a systems language. For now, the only way to get precise scanning would be to call malloc directly or use a template that does. Eventually new would either be fixed to instantiate the bitMask!(T) template or eliminated entirely. Since using the precise scanning by writing a template that calls GC.malloc() is a lot easier than hacking the compiler, this may be a catalyst for getting rid of new. 2. Some concern has been expressed in the past about the possibility of using 4 bytes per block in overhead to store pointers to bitmasks. IMHO this concern is misplaced because in any program that takes up enough memory for space efficiency to matter, false pointers waste more space. Furthermore, if you're programming for a toaster oven, elevator controller, etc., you probably aren't going to use the standard out-of-the-box GC anyhow. However, I've thought of ways to mitigate this and have come up with the following: A. Store a pointer to the bitmask iff the NO_SCAN bit is set. this is a no-brainer and will prevent any overhead on, for example, arrays of floats. B. Add another attributes bit to the GC called something like NEEDS_BITMASK. This bit would be set iff an object mixes pointers and non-pointers. If it's not set, no bitmask pointer would be stored. However, the overhead of an extra bit may or may not be worth it. 3. The bitmask pointer would be stored at the end of every GC-allocated block for which a bitmask pointer is stored. The reason for using the end of the block instead of the beginning is just implementation simplicity: That way, finding the beginning of a block would work the same whether or not we have a bitmask pointer. 4. The bitmask would be a size_t[] created at compile time by a template and stored in the static data segment. Its layout would be [length of array, T.sizeof, offsets that need to be scanned]. For example, if you have something like: struct Foo { uint bar; void* ptr; } On a 32-bit machine, bitMask!Foo would be [3, 8, 4]. On a 64-bit, it would be [3, 16, 8]. The reason the size of the array is stored in the array is so that we can get away with storing a single ptr in each memory block instead of a pointer and a length. 5. To store information about pinning, we simply use the high-order bits of the pointer offsets. 1 means pinned, 0 means not pinned. This means that, for any type T, T.sizeof can't be bigger than size_t.max / 2. I think this is a fairly minor limitation.