Philippe Verdy <verdy underscore p at wanadoo dot fr> wrote: >> Here is a string, expressed as a sequence of bytes in SCSU: >> >> 05 1C 4D 6F 73 63 6F 77 05 1D 20 69 73 20 12 9C BE C1 BA B2 B0 2E >> M o s s o v SP i s SP . > > Without looking at it, it's easy to see that this tream is separated > in three sections, initiated by 05 1C, then 05 1D, then 12. I can't > remember without looking at the UTN what they perform (i.e. which > Unicode code points range they select), but the other bytes are simple > offsets relative to the start of the selected ranges. Also the third > section is ended by a regular dot (2E) in the ASCII range selected for > the low half-page, and the other bytes are offsets for the script > block initiated by 12.
05 is a static-quote tag which modifies only the next byte. It doesn't really initiate a new section; it's intended for isolated characters where initiating a new section would be wasteful. The sequences <05 1C> and <05 1D> encode the matching double-quote characters U+201C and U+201D respectively. 12 switches to a new dynamic window -- in this case, window 2, which is predefined to point to the Cyrillic block -- so it does select a range as you said. Also, the ASCII bytes do represent Basic Latin characters. > Immediately I can identify this string, without looking at any table: > > "Mossov?" is ??????. > > where each ? replaces a character that I can't decipher only through > my defective memory. (I don't need to remember the details of the > standard table of ranges, because I know that this table is complete > in a small and easily available document). Actually "Moscow," not "Mossov" -- but as you said, this is not important because a computer would have gotten this arithmetic right. The actual string is: âMoscowâ is ÐÐÑÐÐÐ. > The decoder part of SCSU still remains extremely trivial to implement, > given the small but complete list of codes that can alter the state of > the decoder, because there's no choice in its interpretation and > because the set of variables to store the decoder state is very > limited, as well as the number of decision tests at each step. This is > a "finite state automata". I think "extremely trivial" is overstating the case a bit. It is straightforward and not very difficult, but still somewhat more complex than a UTF. (There had better not be any choice in interpretation, if we want lossless decompression!) BTW, the singular is "automaton." > Only the encoder may be a bit complex to write (if one wants to > generate the optimal smallest result size), but even a moderate > programmer could find a simple and working scheme with a still > excellent compression rate (around 1 to 1.2 bytes per character on > average for any Latin text, and around 1.2 to 1.5 bytes per character > for Asian texts which would still be a good application of SCSU face > to UTF-32 or even UTF-8). UTN #14 contains pseudocode for an encoder that beats the Japanese example in UTS #6 (by one byte, big deal) and can be easily translated into working code. -Doug Ewell Fullerton, California http://users.adelphia.net/~dewell/