In a message dated 2002-01-21 5:20:55 Pacific Standard Time, [EMAIL PROTECTED] writes:
> Doug Ewell wrote: >> Devanagari text encoded in SCSU occupies exactly 1 byte per >> character, plus an additional byte near the start of the >> file to set the current window (0x14 = SC4). > > The problem is what happens if that very byte gets corrupted for any > reason... > > If an octet is erroneously deleted, changed or added from an UTF-8 stream, > only a single character would be corrupted. If the same thing happens to the > window-setting byte of a SCSU (or other similar "zany" formats), the whole > stream turns into garbage. Yes, SCSU is stateful and the corruption of a single tag, or argument to a tag, could potentially damage large amounts of text. I know this was a big problem in the days of devices and transmission protocols that did little or no error correction. I honestly don't know how big a problem it is today. > What this means in practice for website developers is: > > 1) SCSU text can only be edited with a text editor which properly decodes > the *whole* file on load and re-encodes it on save. On the other hand, UTF-8 > text can also be edited using an encoding-unaware editor, although non-ASCII > text is invisible. I have edited SCSU text using a completely encoding-ignorant MS-DOS editor. Of course I couldn't edit the SCSU control bytes intelligently, but then I can't edit multibyte UTF-8 sequences intelligently with it either. > 2) SCSU text cannot be built by assembling binary pieces coming from > external sources. E.g., you cannot get a SCSU-encoded template file and fill > in the blanks with customer data coming from a SCSU-encoded database: each > time you insert a piece of text coming from the database, you delete the > current window information, turning into garbage the rest of the file. The current window information is not deleted, it is carried over into any adjoining text that does not redefine it. (This could have its own repercussions, of course.) > 3) A SCSU page can only be accepted by browsers and e-mail readers that are > able to decode it. On the other hand, UTF-8 also works on old ASCII-based > browsers, although non-ASCII text is clearly not properly displayed. Same as 1). If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences that readers must know how to decode. SCSU does use states, like any compression scheme, so an encoding-ignorant tool will probably have more trouble with SCSU than with UTF-8. But I was not arguing to foist SCSU on an unprepared world, I was suggesting that the world should prepare. \u263a -Doug Ewell Fullerton, California