re *Data Structure Issues* * * My comments were based not on what you could do but what you can do on the CLR /JVM now.
I though the main reasons for the array located in the string were GC related . - Twice as many GC objects to trace for mark . RIght now strings are not checked by the mark as it knows they are immutable with no references. If it was just a holder for an array object that may be more difficult . ( It could be done but not on the CLR.) . Increasing the # mark checks can be a significant cost in a system that is trying to reduce GC pauses. - It hijacks the size field of the string object - The GC knows nothing about fixed arrays ..they cant exist outside of an object. Huge amounts of small stringslike "/n" now becomes 50 bytes + alignment ( 2 object headers (32) , reference to array(8) , 1 char (2) , array Syncroot ref( 8) ) instead of 18. ( object header (16) + 1 char) . Now in theory you can build a Framework + GC that may have these objects in a seperate area without a header etc but id like to see it built before allowing for it ..as GCs are complex with full of compromises already that something else may not work .* Re cache - The string cache performance is dwarfed by the creation cost of new strings which is very common ..so you would need some form of slices first . I would have liked slices to be just a reference with some high bits for length but that is not possible for the CLR so you have the cost of 2 objects ( the struct and the underlying string ) but for heavy string work the underlying string becomes less important due to reducing the amount of strings created by reusing the exiting string data performance will be much better with immutable slices. (GC mark will be heavier but most of these slices will never escape the nursery) . Im not sure large strings are that critical , i just dont see much code that works on long strings in C# where the long string is stored for a long time .. The long strings i do see are often byte[] ( often even in C the XML processor is a COM object ) eg a web page or XML and then parsed ( as bytes !) once to create many short strings (eg XML or DOM node content) or packed to byte[] to send to other machines in nearly all cases the byte[] is utf-8 . Every one of these small strings is converted from UTF8 after getting it from a native parser. Now if we had UTF-8 and slices you could do lots of tricks eg in a using block directly read the native string for processing with no conversion from the web server or driver and during parsing you can just create what you need eg DOM nodes etc.. now the developer will need to decide does he build the DOM nodes for short work and just uses slices or for long work you would copy the native array with Buffer.Copy to move it into the GC but you have the option either way. This is especially important for unpacking messages in a higher level network stack , right now WCF deals with a byte[] that comes on the wire and it creates messages for the user by copying just that message via running a json/XML deserializer which reduces the GC work - if you dont do this GC pauses become a big problem as i found when i stored a large amount of messages from WCF in queues for 3000 hand held clients. Ben * Thinking further , I could see in the language this being allowed eg you GCAlloc a large block ( just like you will do for regions) and then manage it with unsafe in the runtime ( eg not by the GC and hence no header) . If the holder object gets disposed so does the array it holds. However consider the huge amount str code like abc = str1 + str2 + strn..n .. At the moment the string goes in the nursery and incurs little memory costs ( cmpx nursery pointer , call a method that creates the string header and copies or loads the body data) and no dispose costs .. we would then have these strings creating subarrays and removing them in a heap adding significantly to the cost. * *
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
