Re: Worst case scenarios on SCSU

2001-11-01 Thread DougEwell2

In a message dated 2001-10-31 15:54:34 Pacific Standard Time, 
[EMAIL PROTECTED] writes:

  Has any one done worst case scenarios on SCSU, with respect to other
  methods of encoding Unicode characters?

In addition to theoretical worst-case scenarios, it might also be worthwhile 
to consider the practical limitations of certain encoders.  SCSU does not 
require encoders to be able to utilize the entire syntax of SCSU, so in the 
extreme case, a maximally stupid SCSU compressor could simply quote every 
character as Unicode:

SQU hi-byte lo-byte SQU hi-byte lo-byte ...

This would result in a uniform 50% expansion over UTF-16, which is pretty bad.

On a more realistic level, even good SCSU encoders are not required by the 
specification to be infinitely intelligent and clever in their encoding.  For 
example, I think my encoder is pretty decent, but it encodes the Japanese 
example in UTF #6 in 180 bytes rather than the 178 bytes illustrated in the 
report.  This is because the Japanese data contains a couple of sequences of 
the form kanjikanakanji, where kanji are not compressible and kana 
are.  If there is only one kana between the two kanji, as in this case, 
it is more efficient to just stay in Unicode mode for the kana rather than 
switching modes.  My encoder isn't currently bright enough to figure this 
out.  In the worst case, then, a long-enough sequence of 
kanakanjikanakanji... would take 5 bytes for every 2 BMP characters.

-Doug Ewell
 Fullerton, California




Re: Worst case scenarios on SCSU

2001-10-31 Thread Kenneth Whistler

David Starner wrote:

 Has any one done worst case scenarios on SCSU, with respect to other
 methods of encoding Unicode characters?

As current Czar of Names Rectification, I must start protesting
here. SCSU is a means of *compressing* Unicode text. It is
not [an]other method of encoding Unicode characters.

UTF-32, UTF-16, and UTF-8 are Unicode character encoding forms
(CEF's). Once can, of course, calculate the worst case length
of text in each of those encoding forms, which turns out to
be 4 bytes per character (code point) in each case. One can
also compare various mixtures of text in any of those encoding
forms with various compression schemes for the same text,
including SCSU. But it is important not to compare apples and
oranges here.

 
 The numbers I've got are:

And before going on, I'm not clear exactly what you are
trying to do. SCSU is defined on UTF-16 text. It would, of
course, be possible to create SCSU-like windowing compression
schemes that would work on UTF-32 or UTF-8 text, but those are
not part of UTS #6 as it is currently written.

At the moment, if you want to compare SCSU-compressed text
against the UTF-32 form, you would have to convert the UTF-32
text to UTF-16, and then compress it using SCSU. You don't
apply SCSU directly to UTF-32 data.

 
 UTF-32: Since all characters (including any necessary state changes)
 can be encoded in four characters, and four characters would be
 ^bytes   ^bytes
 necessary for a supplementary character outside any current window, the
 worst case scenario (for short strings) is an optimal SCSU length = the
 UTF-32 length. But in the long run, we must account for the windows. As
 an optimal sequence will probably look like SQX foo bar baz SQX foo bar
 baz SCn byte SQX foo bar baz . . . SCSU length = UTF-32 length * % of 
 astral characters not in able to be covered by 7 windows + UTF-32 length
 * 2/4 * % of astral characters covered by 7 windows + 2 bytes * 7 windows
 (to initially set up the windows)
 = UTF-32 length * 8185/8192 + UTF-32 length * 7/16384 + 14 
 = UTF-32 length * 16377/16384 + 14 
 (actually, min of this and UTF-32 length.)

This analysis might apply to a UTF-32 adaptation of SCSU, but that
is a different animal than SCSU as it stands.

 
 UTF-16: This time, our worst case scenario is certain private use
 characters. Since certain private use characters take up 3 bytes (when
 encoded window-less) instead of two in UTF-16, preliminary guess is 3/2
  ^compressed
 the size of UTF-16. It's suseptible to the same problem as above, only
 worse. Encoding all characters in as either SDn window byte, SQU high
 low, or SCn byte, and using the reasoning above gets us
 = UTF-16 length * 3/2 * 61/62 + UTF-16 length * 1/62 + 16
 (This may be somewhat weak, as increasing the ration of private use
 characters makes windows more useful, and decreasing it makes Unicode
 mode more useful.)

I don't understand this analysis. The worst case for SCSU is always
UTF-16 length + 1 byte. This because if any garden path down the
heuristics leads to further expansions, you can always represent the
text as:

   SCU + (the rest of the text in Unicode)

 
 UTF-8: Worst case scenario is a series of NULs (or similar characters).
 Since this gives us a string with twice the length of the corresponding
 UTF-8 string, it can't be windowized, and there's no other characters
 that have much if any expansion, I'd say the worst case scenario is 2 *
 the UTF-8 length.

Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01...
I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...?
(Actually NULs themselves would not be a problem, since they are passed
as single bytes 0x00.)

--Ken




Re: Worst case scenarios on SCSU

2001-10-31 Thread David Starner

On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote:
 And before going on, I'm not clear exactly what you are
 trying to do. SCSU is defined on UTF-16 text. 

Why do you say that? I can't find the phrase UTF-16 in UTS-6. It's
says that it's a compression scheme for Unicode and that [SCSU] is
mainly intended for use with short to medium length Unicode strings..
I noticed that the sample strings are in UTF-16, and count surrogate
pairs as two characters (I think; for 9.4, I count 17 characters
counting pairs as 1 and 19 as two, whereas the text claims 20), but I
that's merely informative anyway.

All the SCSU pieces I've written work directly from UTF-32. I'll admit
I haven't done much checking with other encoders/decoders, but my
decoder can handle all the sample strings correctly, as well as every
thing my encoders put out.

  UTF-32: Since all characters (including any necessary state changes)
  can be encoded in four characters, and four characters would be
  ^bytes   ^bytes

Yes, sorry.

 I don't understand this analysis. The worst case for SCSU is always
 UTF-16 length + 1 byte. This because if any garden path down the
 heuristics leads to further expansions, you can always represent the
 text as:
 
SCU + (the rest of the text in Unicode)

Section 5.2.1: Each reserved tag value collides with 256 Unicode
characters. If you do that and have private use values in your UTF-16
string, decoding the SCSU will produce a different text.
 
 Here, you are saying that if I have a UTF-8 string 0x01 0x01 0x01 0x01...
 I'd have to represent it in SCSU as 0x0F 0x00 0x01 0x00 0x01 0x00 0x01...?
 (Actually NULs themselves would not be a problem, since they are passed
 as single bytes 0x00.)

Right. I was thinking of SQ0 0x01 SQ0 0x01 . . . but it's the same idea.

-- 
David Starner - [EMAIL PROTECTED]
Pointless website: http://dvdeug.dhis.org
I saw a daemon stare into my face, and an angel touch my breast; each 
one softly calls my name . . . the daemon scares me less.
- Disciple, Stuart Davis




Re: Worst case scenarios on SCSU

2001-10-31 Thread Kenneth Whistler

David Starner wrote:

 On Wed, Oct 31, 2001 at 05:04:44PM -0800, Kenneth Whistler wrote:
  And before going on, I'm not clear exactly what you are
  trying to do. SCSU is defined on UTF-16 text. 
 
 Why do you say that? I can't find the phrase UTF-16 in UTS-6. 

UTS #6 is a very early Unicode Technical Report. It was drafted,
and essentially completed, before UTF-8 was formally incorporated
into the Unicode Standard (in Unicode 3.0) and well before
UTF-32 was defined and formally incorporated into the Unicode
Standard (in Unicode 3.1). When it was written, Unicode *was*
UTF-16, and nobody went out of their way to make the distinction
in terms all the time. This is true of all Unicode documents from
the Unicode 2.0 era.

 It's
 says that it's a compression scheme for Unicode and that [SCSU] is
 mainly intended for use with short to medium length Unicode strings..
 I noticed that the sample strings are in UTF-16, and count surrogate
 pairs as two characters (I think; for 9.4, I count 17 characters
 counting pairs as 1 and 19 as two, whereas the text claims 20), but I
 that's merely informative anyway.
 
 All the SCSU pieces I've written work directly from UTF-32. I'll admit
 I haven't done much checking with other encoders/decoders, but my
 decoder can handle all the sample strings correctly, as well as every
 thing my encoders put out.

I have no quarrel with the claim that the SCSU scheme could be
implemented directly on UTF-32 data. But as Unicode Technical Standard
#6 is currently written, that is not how to do it conformantly.

It seems to me that a rewrite of SCSU would be in order to explicitly
allow and define UTF-32 implementations as well as UTF-16 implementations
of SCSU.

 
  I don't understand this analysis. The worst case for SCSU is always
  UTF-16 length + 1 byte. This because if any garden path down the
  heuristics leads to further expansions, you can always represent the
  text as:
  
 SCU + (the rest of the text in Unicode)
 
 Section 5.2.1: Each reserved tag value collides with 256 Unicode
 characters. If you do that and have private use values in your UTF-16
 string, decoding the SCSU will produce a different text.

My mistake. I went back to my own implementation to remind myself of
the problem involved with the private use characters and the need
for tag quoting. You are correct that if you pick certain aberrant
combinations of PUA characters that themselves cannot compress, you
end up with 3/2 * UTF-16 length as the worst case.

--Ken




Re: Worst case scenarios on SCSU

2001-10-31 Thread DougEwell2

It must be a full moon on Halloween, because here I am in the extremely 
unfamiliar position of disagreeing quite strongly with Ken Whistler.

In a message dated 2001-10-31 17:16:25 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

  As current Czar of Names Rectification, I must start protesting
  here. SCSU is a means of *compressing* Unicode text. It is
  not [an]other method of encoding Unicode characters.

I was about to reply, Of course it is, before I realized that Ken was 
interpreting the word encoding in the strictest sense, invoking the 
distinction between character encoding forms (CEFs) and transfer encoding 
syntaxes (TESs).  In some cases this is a worthwhile distinction, but I don't 
think it is relevant in the case of David's query, or, for that matter, in 
many other cases where users may think of Unicode text being represented as 
UTF-32, UTF-16, UTF-8, SCSU, ASCII with UCN sequences, or even (God forbid) 
CESU-8.

SCSU is indeed another method of representing Unicode characters, if not 
necessarily encoding them in the strict sense of the word.

  And before going on, I'm not clear exactly what you are
  trying to do. SCSU is defined on UTF-16 text. It would, of
  course, be possible to create SCSU-like windowing compression
  schemes that would work on UTF-32 or UTF-8 text, but those are
  not part of UTS #6 as it is currently written.

Like David, I don't see how SCSU is defined on, or limited to, UTF-16 text, 
except in the sense that literal or quoted Unicode-mode SCSU text is 
UTF-16.  SCSU is defined on Unicode scalar values, which are not tied to a 
particular CEF.

You can define an window in what SCSU calls the expansion space using the 
SDX or UDX tag and, in the best case, store N characters of Gothic or Deseret 
text in N + 3 bytes.  None of this has anything to do with surrogates or 
16-bitness.

In a message dated 2001-10-31 17:59:33 Pacific Standard Time, [EMAIL PROTECTED] 
writes:

  I have no quarrel with the claim that the SCSU scheme could be
  implemented directly on UTF-32 data. But as Unicode Technical Standard
  #6 is currently written, that is not how to do it conformantly.

I have looked throughout UTS #6 and cannot find anything, explicit or 
implicit, to the effect that SCSU could not be conformantly implemented 
against UTF-32 data.  Sections 6.1.3 and 8.1 refer to how surrogate pairs 
may be encoded (*) in SCSU, but if you substitute the phrase non-BMP 
characters the meaning is identical.

(*) The word encoded was taken directly from UTS #6, section 8.1.

  At the moment, if you want to compare SCSU-compressed text
  against the UTF-32 form, you would have to convert the UTF-32
  text to UTF-16, and then compress it using SCSU. You don't
  apply SCSU directly to UTF-32 data.

Why not?  The fact that UTS #6 was originally written before UTF-32 was 
formally defined has nothing to do with this.  The same could be said for 
UTF-8, which (like SCSU) has a surrogate-free mechanism for representing 
non-BMP characters.

  It seems to me that a rewrite of SCSU would be in order to explicitly
  allow and define UTF-32 implementations as well as UTF-16 implementations
  of SCSU.

I don't see anything that needs rewriting.  What are you seeing?

-Doug Ewell
 Fullerton, California




Re: Worst case scenarios on SCSU

2001-10-31 Thread Asmus Freytag

At 05:50 PM 10/31/01 -0800, Kenneth Whistler wrote:
I have no quarrel with the claim that the SCSU scheme could be
implemented directly on UTF-32 data. But as Unicode Technical Standard
#6 is currently written, that is not how to do it conformantly.

Actually, no specific encoding form is required for the uncompressed data.
SCSU has always been a transformation from code point sequences to byte
sequences. As long as the same byte sequence represents the same code point
sequence, the implementation is conformant. (The encoder and decoder should
probably state very clearly what encoding form they consume, resp. emit).


It seems to me that a rewrite of SCSU would be in order to explicitly
allow and define UTF-32 implementations as well as UTF-16 implementations
of SCSU.

What is needed is a rewrite of SCSU that makes explicit that in the SCSU
*compressed* data stream unicode mode is always UTF-16BE (instead of
two byte unicode in the usual way, as the current text reads ;-)

I have completed such a rewrite, with modest updates of the terminology,
so as to not actually require Unicode 3.0 or 3.1 as base document. Since
formally SCSU uses Unicode 2.0.0 as base version, I have felt it
inappropriate to go overboard in making changes.

For that reason, I have introduced the term supplementary code space
as definition in TR6 itself. This allows me to eliminate references
to expansion space which readers coming from 3.x can no longer follow,
without requiring formal reference to 3.1 just for different words for the
same thing.

Another goal was to limit the places in which text was changed, since no
*technical* change of the specification is intended, and wholesale changes
would have obscured this fact.

I have added a short section on worst-case behavior as well.

The resulting draft is posted on http://www.unicode.org/~asmus/tr6-3.3d1.html
for input.

A./