Re: Worst case scenarios on SCSU
In a message dated 2001-10-31 15:54:34 Pacific Standard Time, [EMAIL PROTECTED] writes: Has any one done worst case scenarios on SCSU, with respect to other methods of encoding Unicode characters? In addition to theoretical worst-case scenarios, it might also be worthwhile to consider the practical limitations of certain encoders. SCSU does not require encoders to be able to utilize the entire syntax of SCSU, so in the extreme case, a maximally stupid SCSU compressor could simply quote every character as Unicode: SQU hi-byte lo-byte SQU hi-byte lo-byte ... This would result in a uniform 50% expansion over UTF-16, which is pretty bad. On a more realistic level, even good SCSU encoders are not required by the specification to be infinitely intelligent and clever in their encoding. For example, I think my encoder is pretty decent, but it encodes the Japanese example in UTF #6 in 180 bytes rather than the 178 bytes illustrated in the report. This is because the Japanese data contains a couple of sequences of the form kanjikanakanji, where kanji are not compressible and kana are. If there is only one kana between the two kanji, as in this case, it is more efficient to just stay in Unicode mode for the kana rather than switching modes. My encoder isn't currently bright enough to figure this out. In the worst case, then, a long-enough sequence of kanakanjikanakanji... would take 5 bytes for every 2 BMP characters. -Doug Ewell Fullerton, California
Re: Worst case scenarios on SCSU
David Starner wrote: Has any one done worst case scenarios on SCSU, with respect to other methods of encoding Unicode characters? As current Czar of Names Rectification, I must start protesting here. SCSU is a means of *compressing* Unicode text. It is not [an]other method of encoding Unicode characters. UTF-32, UTF-16, and UTF-8 are Unicode character encoding forms (CEF's). Once can, of course, calculate the worst case length of text in each of those encoding forms, which turns out to be 4 bytes per character (code point) in each case. One can also compare various mixtures of text in any of those encoding forms with various compression schemes for the same text, including SCSU. But it is important not to compare apples and oranges here. The numbers I've got are: And before going on, I'm not clear exactly what you are trying to do. SCSU is defined on UTF-16 text. It would, of course, be possible to create SCSU-like windowing compression schemes that would work on UTF-32 or UTF-8 text, but those are not part of UTS #6 as it is currently written. At the moment, if you want to compare SCSU-compressed text against the UTF-32 form, you would have to convert the UTF-32 text to UTF-16, and then compress it using SCSU. You don't apply SCSU directly to UTF-32 data. UTF-32: Since all characters (including any necessary state changes) can be encoded in four characters, and four characters would be ^bytes ^bytes necessary for a supplementary character outside any current window, the worst case scenario (for short strings) is an optimal SCSU length = the UTF-32 length. But in the long run, we must account for the windows. As an optimal sequence will probably look like SQX foo bar baz SQX foo bar baz SCn byte SQX foo bar baz . . . SCSU length = UTF-32 length * % of astral characters not in able to be covered by 7 windows + UTF-32 length * 2/4 * % of astral characters covered by 7 windows + 2 bytes * 7 windows (to initially set up the windows) = UTF-32 length * 8185/8192 + UTF-32 length * 7/16384 + 14 = UTF-32 length * 16377/16384 + 14 (actually, min of this and UTF-32 length.) This analysis might apply to a UTF-32 adaptation of SCSU, but that is a different animal than SCSU as it stands. UTF-16: This time, our worst case scenario is certain private use characters. Since certain private use characters take up 3 bytes (when encoded window-less) instead of two in UTF-16, preliminary guess is 3/2 ^compressed the size of UTF-16. It's suseptible to the same problem as above, only worse. Encoding all characters in as either SDn window byte, SQU high low, or SCn byte, and using the reasoning above gets us = UTF-16 length * 3/2 * 61/62 + UTF-16 length * 1/62 + 16 (This may be somewhat weak, as increasing the ration of private use characters makes windows more useful, and decreasing it makes Unicode mode more useful.) I don't understand this analysis. The worst case for SCSU is always UTF-16 length + 1 byte. This because if any garden path down the heuristics leads to further expansions, you can always represent the text as: SCU + (the rest of the text in Unicode) UTF-8: Worst case scenario is a series of NULs (or similar characters). Since this gives us a string with twice the length of the corresponding UTF-8 string, it can't be windowized, and there's no other characters that have much if any expansion, I'd say the worst case scenario is 2 * the UTF-8 length. Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01... I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...? (Actually NULs themselves would not be a problem, since they are passed as single bytes 0x00.) --Ken
Re: Worst case scenarios on SCSU
On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote: And before going on, I'm not clear exactly what you are trying to do. SCSU is defined on UTF-16 text. Why do you say that? I can't find the phrase UTF-16 in UTS-6. It's says that it's a compression scheme for Unicode and that [SCSU] is mainly intended for use with short to medium length Unicode strings.. I noticed that the sample strings are in UTF-16, and count surrogate pairs as two characters (I think; for 9.4, I count 17 characters counting pairs as 1 and 19 as two, whereas the text claims 20), but I that's merely informative anyway. All the SCSU pieces I've written work directly from UTF-32. I'll admit I haven't done much checking with other encoders/decoders, but my decoder can handle all the sample strings correctly, as well as every thing my encoders put out. UTF-32: Since all characters (including any necessary state changes) can be encoded in four characters, and four characters would be ^bytes ^bytes Yes, sorry. I don't understand this analysis. The worst case for SCSU is always UTF-16 length + 1 byte. This because if any garden path down the heuristics leads to further expansions, you can always represent the text as: SCU + (the rest of the text in Unicode) Section 5.2.1: Each reserved tag value collides with 256 Unicode characters. If you do that and have private use values in your UTF-16 string, decoding the SCSU will produce a different text. Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01... I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...? (Actually NULs themselves would not be a problem, since they are passed as single bytes 0x00.) Right. I was thinking of SQ0 0x01 SQ0 0x01 . . . but it's the same idea. -- David Starner - [EMAIL PROTECTED] Pointless website: http://dvdeug.dhis.org I saw a daemon stare into my face, and an angel touch my breast; each one softly calls my name . . . the daemon scares me less. - Disciple, Stuart Davis
Re: Worst case scenarios on SCSU
David Starner wrote: On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote: And before going on, I'm not clear exactly what you are trying to do. SCSU is defined on UTF-16 text. Why do you say that? I can't find the phrase UTF-16 in UTS-6. UTS #6 is a very early Unicode Technical Report. It was drafted, and essentially completed, before UTF-8 was formally incorporated into the Unicode Standard (in Unicode 3.0) and well before UTF-32 was defined and formally incorporated into the Unicode Standard (in Unicode 3.1). When it was written, Unicode *was* UTF-16, and nobody went out of their way to make the distinction in terms all the time. This is true of all Unicode documents from the Unicode 2.0 era. It's says that it's a compression scheme for Unicode and that [SCSU] is mainly intended for use with short to medium length Unicode strings.. I noticed that the sample strings are in UTF-16, and count surrogate pairs as two characters (I think; for 9.4, I count 17 characters counting pairs as 1 and 19 as two, whereas the text claims 20), but I that's merely informative anyway. All the SCSU pieces I've written work directly from UTF-32. I'll admit I haven't done much checking with other encoders/decoders, but my decoder can handle all the sample strings correctly, as well as every thing my encoders put out. I have no quarrel with the claim that the SCSU scheme could be implemented directly on UTF-32 data. But as Unicode Technical Standard #6 is currently written, that is not how to do it conformantly. It seems to me that a rewrite of SCSU would be in order to explicitly allow and define UTF-32 implementations as well as UTF-16 implementations of SCSU. I don't understand this analysis. The worst case for SCSU is always UTF-16 length + 1 byte. This because if any garden path down the heuristics leads to further expansions, you can always represent the text as: SCU + (the rest of the text in Unicode) Section 5.2.1: Each reserved tag value collides with 256 Unicode characters. If you do that and have private use values in your UTF-16 string, decoding the SCSU will produce a different text. My mistake. I went back to my own implementation to remind myself of the problem involved with the private use characters and the need for tag quoting. You are correct that if you pick certain aberrant combinations of PUA characters that themselves cannot compress, you end up with 3/2 * UTF-16 length as the worst case. --Ken
Re: Worst case scenarios on SCSU
It must be a full moon on Halloween, because here I am in the extremely unfamiliar position of disagreeing quite strongly with Ken Whistler. In a message dated 2001-10-31 17:16:25 Pacific Standard Time, [EMAIL PROTECTED] writes: As current Czar of Names Rectification, I must start protesting here. SCSU is a means of *compressing* Unicode text. It is not [an]other method of encoding Unicode characters. I was about to reply, Of course it is, before I realized that Ken was interpreting the word encoding in the strictest sense, invoking the distinction between character encoding forms (CEFs) and transfer encoding syntaxes (TESs). In some cases this is a worthwhile distinction, but I don't think it is relevant in the case of David's query, or, for that matter, in many other cases where users may think of Unicode text being represented as UTF-32, UTF-16, UTF-8, SCSU, ASCII with UCN sequences, or even (God forbid) CESU-8. SCSU is indeed another method of representing Unicode characters, if not necessarily encoding them in the strict sense of the word. And before going on, I'm not clear exactly what you are trying to do. SCSU is defined on UTF-16 text. It would, of course, be possible to create SCSU-like windowing compression schemes that would work on UTF-32 or UTF-8 text, but those are not part of UTS #6 as it is currently written. Like David, I don't see how SCSU is defined on, or limited to, UTF-16 text, except in the sense that literal or quoted Unicode-mode SCSU text is UTF-16. SCSU is defined on Unicode scalar values, which are not tied to a particular CEF. You can define an window in what SCSU calls the expansion space using the SDX or UDX tag and, in the best case, store N characters of Gothic or Deseret text in N + 3 bytes. None of this has anything to do with surrogates or 16-bitness. In a message dated 2001-10-31 17:59:33 Pacific Standard Time, [EMAIL PROTECTED] writes: I have no quarrel with the claim that the SCSU scheme could be implemented directly on UTF-32 data. But as Unicode Technical Standard #6 is currently written, that is not how to do it conformantly. I have looked throughout UTS #6 and cannot find anything, explicit or implicit, to the effect that SCSU could not be conformantly implemented against UTF-32 data. Sections 6.1.3 and 8.1 refer to how surrogate pairs may be encoded (*) in SCSU, but if you substitute the phrase non-BMP characters the meaning is identical. (*) The word encoded was taken directly from UTS #6, section 8.1. At the moment, if you want to compare SCSU-compressed text against the UTF-32 form, you would have to convert the UTF-32 text to UTF-16, and then compress it using SCSU. You don't apply SCSU directly to UTF-32 data. Why not? The fact that UTS #6 was originally written before UTF-32 was formally defined has nothing to do with this. The same could be said for UTF-8, which (like SCSU) has a surrogate-free mechanism for representing non-BMP characters. It seems to me that a rewrite of SCSU would be in order to explicitly allow and define UTF-32 implementations as well as UTF-16 implementations of SCSU. I don't see anything that needs rewriting. What are you seeing? -Doug Ewell Fullerton, California
Re: Worst case scenarios on SCSU
At 05:50 PM 10/31/01 -0800, Kenneth Whistler wrote: I have no quarrel with the claim that the SCSU scheme could be implemented directly on UTF-32 data. But as Unicode Technical Standard #6 is currently written, that is not how to do it conformantly. Actually, no specific encoding form is required for the uncompressed data. SCSU has always been a transformation from code point sequences to byte sequences. As long as the same byte sequence represents the same code point sequence, the implementation is conformant. (The encoder and decoder should probably state very clearly what encoding form they consume, resp. emit). It seems to me that a rewrite of SCSU would be in order to explicitly allow and define UTF-32 implementations as well as UTF-16 implementations of SCSU. What is needed is a rewrite of SCSU that makes explicit that in the SCSU *compressed* data stream unicode mode is always UTF-16BE (instead of two byte unicode in the usual way, as the current text reads ;-) I have completed such a rewrite, with modest updates of the terminology, so as to not actually require Unicode 3.0 or 3.1 as base document. Since formally SCSU uses Unicode 2.0.0 as base version, I have felt it inappropriate to go overboard in making changes. For that reason, I have introduced the term supplementary code space as definition in TR6 itself. This allows me to eliminate references to expansion space which readers coming from 3.x can no longer follow, without requiring formal reference to 3.1 just for different words for the same thing. Another goal was to limit the places in which text was changed, since no *technical* change of the specification is intended, and wholesale changes would have obscured this fact. I have added a short section on worst-case behavior as well. The resulting draft is posted on http://www.unicode.org/~asmus/tr6-3.3d1.html for input. A./