On Wed, Aug 28, 2013 at 6:25 AM, Bennie Kloosteman <[email protected]>wrote:
> ...The fact that 90% of strings are 0X00 0x?? 0x00 0x?? etc seems > monumentally wastefull even for foreign languages .. > That's an amazingly western-centric view, and it's flatly contradicted by actual data. I'm in favor of UTF8 strings, and also of "chunky" strings in which sub-runs are encoded using the most efficient encoding for the run. Those are a lot harder to implement correctly than you might believe. The problem with UTF8 strings is that they do not index efficiently. s[i] becomes an O(log n) operation rather than an O(1) operation. For sequential access you can fix that with an iteration helper class, but not all access is sequential. The same problem exists for strings having mixed formats. > Pretty much 60% of the data moved around or compared for most string > operations is a huge win over C# and Java . Most web sites are UTF8-ASCII > and even foreign web sites are 80-90% ASCII . > Think middle tier performance json , xml etc etc , Maybe enough to lift > mono over those products. > The proportion of in-heap string data has grown since I last saw comprehensive measurements, and for applications like DOM trees it's a big part of the total live working set. But data copies are *not* the dominant issue in performance in such applications. Data indexing is. This is why IBM's ICU library is so important. It reconciles all of the conflicting definitions of indexing methods and implements the classes that make the reconciliation possible. > It would be nice if immutable shallow types are interred in the special > heap where the mark doesnt scan like strings but i doubt thats possible. > Also the above is not possible in safe C# ( because of the fixed array) > Mark *never* scans strings, so I don't know what you mean here.
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
