In a message dated 2002-01-21 5:20:55 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

> Doug Ewell wrote:
>> Devanagari text encoded in SCSU occupies exactly 1 byte per
>> character, plus an additional byte near the start of the
>> file to set the current window (0x14 = SC4).
>
> The problem is what happens if that very byte gets corrupted for any
> reason...
>
> If an octet is erroneously deleted, changed or added from an UTF-8 stream,
> only a single character would be corrupted. If the same thing happens to the
> window-setting byte of a SCSU (or other similar "zany" formats), the whole
> stream turns into garbage.

Yes, SCSU is stateful and the corruption of a single tag, or argument to a 
tag, could potentially damage large amounts of text.  I know this was a big 
problem in the days of devices and transmission protocols that did little or 
no error correction.  I honestly don't know how big a problem it is today.

> What this means in practice for website developers is:
>
> 1) SCSU text can only be edited with a text editor which properly decodes
> the *whole* file on load and re-encodes it on save. On the other hand, UTF-8
> text can also be edited using an encoding-unaware editor, although non-ASCII
> text is invisible.

I have edited SCSU text using a completely encoding-ignorant MS-DOS editor.  
Of course I couldn't edit the SCSU control bytes intelligently, but then I 
can't edit multibyte UTF-8 sequences intelligently with it either.

> 2) SCSU text cannot be built by assembling binary pieces coming from
> external sources. E.g., you cannot get a SCSU-encoded template file and fill
> in the blanks with customer data coming from a SCSU-encoded database: each
> time you insert a piece of text coming from the database, you delete the
> current window information, turning into garbage the rest of the file.

The current window information is not deleted, it is carried over into any 
adjoining text that does not redefine it.  (This could have its own 
repercussions, of course.)

> 3) A SCSU page can only be accepted by browsers and e-mail readers that are
> able to decode it. On the other hand, UTF-8 also works on old ASCII-based
> browsers, although non-ASCII text is clearly not properly displayed.

Same as 1).  If you have only ASCII text, SCSU == UTF-8 == ASCII, and if you 
have non-ASCII text, both SCSU and UTF-8 encode that text with byte sequences 
that readers must know how to decode.  SCSU does use states, like any 
compression scheme, so an encoding-ignorant tool will probably have more 
trouble with SCSU than with UTF-8.  But I was not arguing to foist SCSU on an 
unprepared world, I was suggesting that the world should prepare.  \u263a

-Doug Ewell
 Fullerton, California

Reply via email to