On Thu, Aug 29, 2013 at 3:25 AM, Jonathan S. Shapiro <[email protected]>wrote:

> On Wed, Aug 28, 2013 at 6:25 AM, Bennie Kloosteman <[email protected]>wrote:
>
>> ...The fact that 90% of strings are 0X00 0x??  0x00 0x?? etc seems
>> monumentally wastefull even for foreign languages ..
>>
>
> That's an amazingly western-centric view, and it's flatly contradicted by
> actual data.
>

I live in china ......and posted the data before ... look how much
javascript , postscript ,and html keyword are  in pages .. and even xml
 and json types , messages etc etc . I pulled down 20 foreign web sites and
they were nearly all over 80% ASCII  because of the influence in english on
software..

re western view do you know most chinese strings are 1/3 the size because
they make use of the full 16 bits anyway !  And Unicode is not even
official here , officially you should use ASCII and then use an encoding
scheme GB or GBK ( Unicode  cant do newer chars so they do this encoding on
unicode anyway and suffer a double whamy because the encoded chars are
wider) .

>
> I'm in favor of UTF8 strings, and also of "chunky" strings in which
> sub-runs are encoded using the most efficient encoding for the run. Those
> are a lot harder to implement correctly than you might believe.
>

I know most of the issue we have discussed them earlier  in bitc .. i even
had a look at converting mono to UTF8  in the string impl but there were
too many native and usafe hooks it  made it too hard.

>
> The problem with UTF8 strings is that they do not index efficiently. s[i]
> becomes an O(log n) operation rather than an O(1) operation. For sequential
> access you can fix that with an iteration helper class, but not all access
> is sequential. The same problem exists for strings having mixed formats.
>

If you know its ASCII it indexes fast .. most strings are very small  and
SIMD can scan 32 characters at a time  for 0x10 , so you can quickly build
an offset index.  I bet nearly all indexing is on english chars anyway ...
You may say long strings but nearly all long strings are UTF-8 !.

>
>
>> Pretty much 60% of the data moved around or compared for most string
>> operations is a huge win over C# and Java  . Most web sites are UTF8-ASCII
>> and even foreign web sites are 80-90% ASCII .
>> Think middle tier performance json , xml  etc etc , Maybe enough to lift
>> mono over those products.
>>
>
> The proportion of in-heap string data has grown since I last saw
> comprehensive measurements, and for applications like DOM trees it's a big
> part of the total live working set. But data copies are *not* the
> dominant issue in performance in such applications. Data indexing is. This
> is why IBM's ICU library is so important. It reconciles all of the
> conflicting definitions of indexing methods and implements the classes that
> make the reconciliation possible.
>
>
1.Reducing a heap size by 35% does affect performance , if its paging it
increases it a lot  , you also increase cache performance. On hand held teh
memory saving allows better algos to be used for the rest of the app.

2.You cant index Asian chars on Unicode anyway as there is not a 1:1
Unicode to char relationship ( see encoding above ) , they nearly always
use additional libs .. so by using Unicode you hurt most asian  languages ,
since  they build custom encoding ontop of unicode  , you hurt western
european performance..including english chars embedded in the language of
every one . You  benefit  letter ( not character) based non european
languages  provided the english content is not too hiigh.

3. If your doing real intensive mutable indexing , your not working with C#
immutable strings and likely working with char array  and tree based
representations. re DOM trees the Rust guys know a lot about DOM trees
since they are the firefox team and use UTF-8 as for strings in rust.  (
and USC-4  for chars).

4. To do the indexing in most cases yoru doing an O(n) scan anyway so no
difference  eg find first "<node>  then find next </node> then subtract is
identical between UTF8 and UTF16 . Performance is only significantly
different if you go get the 1000 th character after the index of node.
Also the string can use a bit after a complete scan  ( or if indicated on
construction) indicating its ascii and eliminate the escape check .


Ben
_______________________________________________
bitc-dev mailing list
[email protected]
http://www.coyotos.org/mailman/listinfo/bitc-dev

Reply via email to