On Wed, Sep 11, 2013 at 7:20 AM, Jonathan S. Shapiro <[email protected]>wrote:

> On Mon, Sep 9, 2013 at 11:52 PM, Bennie Kloosteman <[email protected]>wrote:
>
>> But you cant rely on the asumption .. the moment you use the code planes
>> the entire code becomes useless...
>>
>
> Agreed. But there are a surprisingly large number of programs that can
> reasonably refuse to support the extended code planes.
>
> This is equivalent to i can use ASCII in a surpringly large amount of
programs ( all of Europe , North and South Ameria  , East asia  , Oceania)
 ...


> Still, the real answer is that you shouldn't be working in terms of code
> points in any case, and "characters" must necessarily be represented by
> strings if you are following the Way of the Unicode (insert picture of
> white horse idiogram with single twisted horn here).
>

Agree here.

>
>
>
>
>> IMHO ... would have been FAR better of  staying with ASCII and the
>> encodings...
>>
>
> I have just enough experience with the old encodings to disagree
> violently. But it doesn't matter, because what's done is done.
>

I did mean to add and they were organized .

They are still used in East Asia and in China its law  any software here
must support GB encoding ...

>
>
>> Blowing 40-50%  of heap space on 0x00 just leaves a bitter taste in my
>> mouth.
>>
>
> The data I've seen says it isn't that much. Do you have any actual data,
> or is this speculative?
>

In between   .  You have posted some comments on how much string data is in
heaps  but this comment is for the apps i have build / used ( mainly web /
server and some Xaml) .   A lot of apps have 80% of data in strings  ( you
can easily see this in .NET with a heap inspector )  , images are normally
stored outside the heap ,  how   much numeric data do you have ?

  I have tested and measured web pages in the past  , source code ,
Atom/.Json and XML in foreign languages and  they are mostly over 90%
 ASCII characters  ( HTML , XML , Javascript , css tags , urls  etc ,
actual content of a web page can be pretty low and the non displayable
content like urls and image tags are often english as well ) .

Here are some web pages. You can see the ascii content by the UTF8-UTF16
ratio and by looking at the page

http://www.columbia.edu/kermit/utf8.html  mixed languages  , i think this
is a really good test case because you get diffirent  languages and english
which is common .. 98%
in utf 8 its 49Kb
in utf16 its 96Kb

sohu.com  92%  ( note  in reality the % is higher as i converted the GB
encoding to UTF16, not  GB on UTF16 )
Utf8 259Kb
Utf16 477Kb

Turkey , http://www.washington.emb.mfa.gov.tr/ 100%
utf8 16k
utf16 32k

http://www.jagran.com/ ( Biggest indian newspaper and a bad case on the web
, lots of local content and  Hindi)  , 83%
utf8-  294K
utf16  490K

http://www.aljazeera.net/portal  , 91%
utf8 221
utf16 404


We also lose quite a bit because of allignment  on all data in the heap
 which is something we cant do much about.

The nursery has an even  higher % of strings as many are short lived which
i include in my heap  . ( think about when your doing lots of string work
the nursery is nearly all strings and half the bytes are 0x00  .. you could
have packed twice as many in before forcing a Gc collect .. which means
less strings moving to a higher gen etc.

Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to