David Starner wrote: > On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote: > > And before going on, I'm not clear exactly what you are > > trying to do. SCSU is defined on UTF-16 text. > > Why do you say that? I can't find the phrase "UTF-16" in UTS-6.
UTS #6 is a very early Unicode Technical Report. It was drafted, and essentially completed, before UTF-8 was formally incorporated into the Unicode Standard (in Unicode 3.0) and well before UTF-32 was defined and formally incorporated into the Unicode Standard (in Unicode 3.1). When it was written, Unicode *was* UTF-16, and nobody went out of their way to make the distinction in terms all the time. This is true of all Unicode documents from the Unicode 2.0 era. > It's > says that it's "a compression scheme for Unicode" and that "[SCSU] is > mainly intended for use with short to medium length Unicode strings.". > I noticed that the sample strings are in UTF-16, and count surrogate > pairs as two characters (I think; for 9.4, I count 17 characters > counting pairs as 1 and 19 as two, whereas the text claims 20), but I > that's merely informative anyway. > > All the SCSU pieces I've written work directly from UTF-32. I'll admit > I haven't done much checking with other encoders/decoders, but my > decoder can handle all the sample strings correctly, as well as every > thing my encoders put out. I have no quarrel with the claim that the SCSU scheme could be implemented directly on UTF-32 data. But as Unicode Technical Standard #6 is currently written, that is not how to do it conformantly. It seems to me that a rewrite of SCSU would be in order to explicitly allow and define UTF-32 implementations as well as UTF-16 implementations of SCSU. > > > I don't understand this analysis. The worst case for SCSU is always > > UTF-16 length + 1 byte. This because if any garden path down the > > heuristics leads to further expansions, you can always represent the > > text as: > > > > SCU + (the rest of the text in Unicode) > > Section 5.2.1: "Each reserved tag value collides with 256 Unicode > characters." If you do that and have private use values in your UTF-16 > string, decoding the SCSU will produce a different text. My mistake. I went back to my own implementation to remind myself of the problem involved with the private use characters and the need for tag quoting. You are correct that if you pick certain aberrant combinations of PUA characters that themselves cannot compress, you end up with 3/2 * UTF-16 length as the worst case. --Ken