Re: Nicest UTF

Doug Ewell Sun, 05 Dec 2004 20:30:57 -0800

Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote:

>> Here is a string, expressed as a sequence of bytes in SCSU:
>>
>> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E
>>       M   o  s  s  o  v       SP  i  s SP                      .
>
> Without looking at it, it's easy to see that this tream is separated
> in three sections, initiated by 05 1C, then 05 1D, then 12. I can't
> remember without looking at the UTN what they perform (i.e. which
> Unicode code points range they select), but the other bytes are simple
> offsets relative to the start of the selected ranges. Also the third
> section is ended by a regular dot (2E) in the ASCII range selected for
> the low half-page, and the other bytes are offsets for the script
> block initiated by 12.


05 is a static-quote tag which modifies only the next byte.  It doesn't
really initiate a new section; it's intended for isolated characters
where initiating a new section would be wasteful.  The sequences <05 1C>
and <05 1D> encode the matching double-quote characters U+201C and
U+201D respectively.

12 switches to a new dynamic window -- in this case, window 2, which is
predefined to point to the Cyrillic block -- so it does select a range
as you said.  Also, the ASCII bytes do represent Basic Latin characters.

> Immediately I can identify this string, without looking at any table:
>
> "Mossov?" is ??????.
>
> where each ? replaces a character that I can't decipher only through
> my defective memory. (I don't need to remember the details of the
> standard table of ranges, because I know that this table is complete
> in a small and easily available document).

Actually "Moscow," not "Mossov" -- but as you said, this is not
important because a computer would have gotten this arithmetic right.
The actual string is:

âMoscowâ is ÐÐÑÐÐÐ.

> The decoder part of SCSU still remains extremely trivial to implement,
> given the small but complete list of codes that can alter the state of
> the decoder, because there's no choice in its interpretation and
> because the set of variables to store the decoder state is very
> limited, as well as the number of decision tests at each step. This is
> a "finite state automata".

I think "extremely trivial" is overstating the case a bit.  It is
straightforward and not very difficult, but still somewhat more complex
than a UTF.  (There had better not be any choice in interpretation, if
we want lossless decompression!)

BTW, the singular is "automaton."

> Only the encoder may be a bit complex to write (if one wants to
> generate the optimal smallest result size), but even a moderate
> programmer could find a simple and working scheme with a still
> excellent compression rate (around 1 to 1.2 bytes per character on
> average for any Latin text, and around 1.2 to 1.5 bytes per character
> for Asian texts which would still be a good application of SCSU face
> to UTF-32 or even UTF-8).

UTN #14 contains pseudocode for an encoder that beats the Japanese
example in UTS #6 (by one byte, big deal) and can be easily translated
into working code.

-Doug Ewell
 Fullerton, California
 http://users.adelphia.net/~dewell/

Re: Nicest UTF

Reply via email to