On Wed, Sep 11, 2013 at 7:20 AM, Jonathan S. Shapiro <[email protected]>wrote:
> On Mon, Sep 9, 2013 at 11:52 PM, Bennie Kloosteman <[email protected]>wrote: > >> But you cant rely on the asumption .. the moment you use the code planes >> the entire code becomes useless... >> > > Agreed. But there are a surprisingly large number of programs that can > reasonably refuse to support the extended code planes. > > This is equivalent to i can use ASCII in a surpringly large amount of programs ( all of Europe , North and South Ameria , East asia , Oceania) ... > Still, the real answer is that you shouldn't be working in terms of code > points in any case, and "characters" must necessarily be represented by > strings if you are following the Way of the Unicode (insert picture of > white horse idiogram with single twisted horn here). > Agree here. > > > > >> IMHO ... would have been FAR better of staying with ASCII and the >> encodings... >> > > I have just enough experience with the old encodings to disagree > violently. But it doesn't matter, because what's done is done. > I did mean to add and they were organized . They are still used in East Asia and in China its law any software here must support GB encoding ... > > >> Blowing 40-50% of heap space on 0x00 just leaves a bitter taste in my >> mouth. >> > > The data I've seen says it isn't that much. Do you have any actual data, > or is this speculative? > In between . You have posted some comments on how much string data is in heaps but this comment is for the apps i have build / used ( mainly web / server and some Xaml) . A lot of apps have 80% of data in strings ( you can easily see this in .NET with a heap inspector ) , images are normally stored outside the heap , how much numeric data do you have ? I have tested and measured web pages in the past , source code , Atom/.Json and XML in foreign languages and they are mostly over 90% ASCII characters ( HTML , XML , Javascript , css tags , urls etc , actual content of a web page can be pretty low and the non displayable content like urls and image tags are often english as well ) . Here are some web pages. You can see the ascii content by the UTF8-UTF16 ratio and by looking at the page http://www.columbia.edu/kermit/utf8.html mixed languages , i think this is a really good test case because you get diffirent languages and english which is common .. 98% in utf 8 its 49Kb in utf16 its 96Kb sohu.com 92% ( note in reality the % is higher as i converted the GB encoding to UTF16, not GB on UTF16 ) Utf8 259Kb Utf16 477Kb Turkey , http://www.washington.emb.mfa.gov.tr/ 100% utf8 16k utf16 32k http://www.jagran.com/ ( Biggest indian newspaper and a bad case on the web , lots of local content and Hindi) , 83% utf8- 294K utf16 490K http://www.aljazeera.net/portal , 91% utf8 221 utf16 404 We also lose quite a bit because of allignment on all data in the heap which is something we cant do much about. The nursery has an even higher % of strings as many are short lived which i include in my heap . ( think about when your doing lots of string work the nursery is nearly all strings and half the bytes are 0x00 .. you could have packed twice as many in before forcing a Gc collect .. which means less strings moving to a higher gen etc. Ben
_______________________________________________ bitc-dev mailing list [email protected] http://www.coyotos.org/mailman/listinfo/bitc-dev
