Doug Ewell wrote:
BOCU-1 might solve this problem, but multiplying and dividing by 243
doesn't sound faster than UTF-8 bit-shifting.  (I'm still amazed by the
claim in UTN #6 that converting Hindi text between UTF-16 and BOCU-1
took only 45% as long as converting it between UTF-16 and UTF-8.)

"claim"? That hurts...


I did measure these things, and the numbers in the table are all from my measurements. I also included the type of machine I used, etc. (http://www.unicode.org/notes/tn6/#Performance)

The reason why BOCU-1 (and SCSU) is often faster than UTF-8 is that BOCU-1 goes into single-byte mode for small scripts like Hindi. Single-byte mode only performs a subtraction, no div/mod or even bit-shifting, and writes/reads only one byte per character. It is also optimized in ICU with a tight inner loop.

UTF-8 on the other hand encodes Hindi with 3 bytes per character and has to perform the bit-shifting and write to/read from more memory locations.

It's the same for Greek/Russian/Arabic etc., although to a lesser degree because it's single bytes with BOCU-1 vs. only 2 bytes per character with UTF-8.

The fact that BOCU-1 not only achieves good compression (and binary order and MIME text/ compatibility) but also reasonable conversion performance encouraged Mark and me to publish it.

UTF-8 is useful because it's simple, and supported just about everywhere - but it's otherwise hardly optimal for anything.

If you want high-speed, compact encoding, use SCSU. If you want good speed, compact encoding, and binary order and/or MIME compatibility, use BOCU-1. Make sure that both sides of the wire know what's going across.

markus



Reply via email to