RE: Compression through normalization

2003-12-06 Thread Philippe Verdy
Mark Davis writes: > > OK, then I suppose I should play devil's advocate and ask Peter's and > > Philippe's question again: If C10 only restricts the modifications to > > "canonically equivalent sequences," why should there be an additional > > restriction that further limits them to NFC or NFD?

Re: Compression through normalization

2003-12-06 Thread Mark Davis
com â à â - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 23:38 Subject: Re: Compression through norm

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Peter Kirk wrote: >> Subprocesses within a closed system may be able to make certain >> assumptions for efficiency. Process B, for example, may know that >> its only source of input is Process A, which is guaranteed always to >> produce NFC. ... > > Does C9 actually allow this? Well, perhaps wit

Re: Compression through normalization

2003-12-06 Thread Peter Kirk
On 06/12/2003 09:49, Doug Ewell wrote: ... But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form Right. I know that. But Mark and Ken said it should,

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Philippe Verdy wrote: > First C10 only restricts modifications just to preserve all the > semantics of the encoded text in any context. There are situations > where this restriction does not apply: when performing text > transformations (such as folding, or even substringing, which may or > may n

Re: Compression through normalization

2003-12-06 Thread Peter Kirk
On 06/12/2003 03:48, Philippe Verdy wrote: ... But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form Instead I think that it's up to the next process to dete

RE: Compression through normalization

2003-12-06 Thread Philippe Verdy
Doug Ewell > OK, then I suppose I should play devil's advocate and ask Peter's and > Philippe's question again: If C10 only restricts the modifications to > "canonically equivalent sequences," why should there be an additional > restriction that further limits them to NFC or NFD? Or, put another

Re: Compression through normalization

2003-12-06 Thread Jungshik Shin
On Fri, 5 Dec 2003, Doug Ewell wrote: > Philippe Verdy wrote: > > > Still in the same subject, how do the hold KSX standards for Han[g]ul > > compare each other? If they are upward compatible, ans specify that > > the conversion from an old text not using compound letters to the new > > In

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Kenneth Whistler wrote: > I don't think either of our recommendations here are specific > to compression issues. They're not, but compression is what I'm focusing on right now, and your recommendations do *apply* to compression. > Basically, if a process tinkers around with changing sequences >

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Mark Davis wrote: > Think you are missing a negative, see below. > >> Compression techniques may optionally replace certain sequences with >> canonically equivalent sequences to improve efficiency, but *only* if >> the output of the decompressed text is expected to be > is not required to be >> c

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Philippe Verdy wrote: > Still in the same subject, how do the hold KSX standards for Han[g]ul > compare each other? If they are upward compatible, ans specify that > the conversion from an old text not using compound letters to the new > standard does not mandate their composition into compund ja

Re: Compression through normalization

2003-12-05 Thread Mark Davis
Well, in my dialect of Engish, 'ken' and 'can' are nearly indistinguishable, and there are many "can's" in Unicode; probably more than "mark's". I'm reminded of what farmer is supposed to have once said about his produce: "We eat what we can, and what we can't, we can." > P.S. On the other hand,

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 14:01, Philippe Verdy wrote: ... It's just a shame that what was considered as equivalent in the Korean standards is considered as canonically distinct (and even compatibility dictinct) in Unicode. This means that the same exact abstract Korean text can have two distinct representat

Re: Compression through normalization

2003-12-05 Thread Michael Everson
At 13:13 -0800 2003-12-05, Kenneth Whistler wrote: On the other hand, if you asked him nicely, Mark might find the more marked form, NFD, to his liking, especially since it is likely to contain more combining marks. Mark is definitely in favor of markedness. I, on the other hand, am definitely

RE: Compression through normalization

2003-12-05 Thread Philippe Verdy
Mark Davis writes: > Doug Ewell writes: > > OK. So it's Mark, not me, who is unilaterally extending C10. > > Where on earth do you get that? I did say that, in practice, NFC should be > produced, but that is simply a practical guideline, independent of C10. I also think that the NFC form is not r

Re: Compression through normalization

2003-12-05 Thread Kenneth Whistler
Doug asked: > Mark indicated that a compression-decompression cycle should not only > stick to canonical-equivalent sequences, which is what C10 requires, but > should convert text only to NFC (if at all). Ken mentioned > normalization "to forms NFC or NFD," but I'm not sure this was in the > sam

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 10:03, Mark Davis wrote: OK. So it's Mark, not me, who is unilaterally extending C10. Where on earth do you get that? I did say that, in practice, NFC should be produced, but that is simply a practical guideline, independent of C10. Mark Well, of course "unilaterally extendi

Re: Compression through normalization

2003-12-05 Thread Mark Davis
t;Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 08:43 Subject: Re: Compression through normalization > Kenneth Whistler wrote: > > > Canonical equivalence is about not modifying the interpretation of the > > text. That

Re: Compression through normalization

2003-12-05 Thread Mark Davis
to.com â à â - Original Message - From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]> Cc: "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 02:51 Subject: Re: Compression through normalization > On

Re: Compression through normalization

2003-12-05 Thread Doug Ewell
Kenneth Whistler wrote: > Canonical equivalence is about not modifying the interpretation of the > text. That is different from considerations about not changing the > text, period. > > If some process using text is sensitive to *any* change in the text > whatsover (CRC-checking or any form of di

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 00:34, Doug Ewell wrote: Peter Kirk wrote: Surely ignoring Composition Exclusions is not unilaterally extending C10. The excluded precomposed characters are still canonically equivalent to the decomposed (and normalised) forms. And so composing a text with them, for compression

Re: Compression through normalization

2003-12-05 Thread Doug Ewell
Peter Kirk wrote: > Surely ignoring Composition Exclusions is not unilaterally extending > C10. The excluded precomposed characters are still canonically > equivalent to the decomposed (and normalised) forms. And so composing > a text with them, for compression or any other purpose, still conform

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
> If some process using text is sensitive to the *interpretation* of > the text, i.e. it is concerned about the content and meaning of > the letters involved, then normalization, to forms NFC or NFD, > which only involve canonical equivalences, will *not* make a difference. > Or to be more subtle a

Re: Compression through normalization

2003-12-04 Thread Kenneth Whistler
Mark said: > The operations of compression followed by decompression can conformantly produce > any text that is canonically equivalent to the original without purporting to > modify the text. (How the internal compressed format is determined is completely > arbitrary - it could NFD, compress, dec

Re: Compression through normalization

2003-12-04 Thread Peter Kirk
On 04/12/2003 08:39, Doug Ewell wrote: ... (2) I am NOT interested in inventing a new normalization form, or any variants on existing forms. Any approach that involves compatibility equivalences, ignores the Composition Exclusions table, or creates equivalences that do not exist in the Unicode

Re: Compression through normalization

2003-12-04 Thread Mark Davis
ument that it does so. Mark __ http://www.macchiato.com â à â - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Thu, 2003 Dec 04 0

Re: Compression through normalization

2003-12-04 Thread Doug Ewell
Just to clear up some possible misconceptions that I think may have developed: This thread started when Philippe Verdy mentioned the possibility of converting certain sequences of Unicode characters to a *canonically equivalent sequence* to improve compression. An example was converting Korean te

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
Kent Karlsson wrote: > Philippe Verdy wrote: > > If we count also the encoded modern LV and LVT johab syllables: > > > > ( ((Ls|Lm)+ (Vs|Vm)+) | > > ((Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)*) | > > ((Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs| > >

RE: Compression through normalization

2003-12-04 Thread Kent Karlsson
Philippe Verdy wrote: ... > letters each. Fortunately, the definition of Hangul syllable blocks need > not be changed, as it works well with Hangul syllables as L+, V+, T* > (where L, V, and T stand for single-letter jamos). In fact the Unicode encoding of modern

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
Kent Karlsson writes: > Philippe Verdy wrote: > > > I just have another question for Korean: many jamos are in fact > > composed from other jamos: this is clearly visible both in their name > > and in their composed glyph. What would be the linguistic impact of > > decomposing them (not canoni

RE: Compression through normalization

2003-12-04 Thread Kent Karlsson
Philippe Verdy wrote: > I just have another question for Korean: many jamos are in fact > composed from other jamos: this is clearly visible both in their name > and in their composed glyph. What would be the linguistic impact of > decomposing them (not canonically!)? Do Korean really learn

Re: Compression through normalization

2003-12-03 Thread John Cowan
Jungshik Shin scripsit: > (e.g. 'enough' in English is logographic in a sense, isn't it?) No, it's just obsolete. Some parts of Hangeul spelling are obsolete too. "&", that's logographic. -- John Cowan[EMAIL PROTECTED] At times of peril or dubitation, h

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Jungshik Shin writes: > On Wed, 3 Dec 2003, Philippe Verdy wrote: > > > I just have another question for Korean: many jamos are in fact composed > > from other jamos: this is clearly visible both in their name > and in their > > composed glyph. What would be the linguistic impact of > decomposin

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
> De : Jungshik Shin [mailto:[EMAIL PROTECTED] >Note that Korean syllables in Unicode are NOT "LVT?" as you > seem to think I did not say that... > BUT "L+V+T*" with '+', '*' and '?' have usual RE meaning. I said this: ( ((L* V* VT T*) - (L* V+ T)) | X )* > Who said that? 11,172 precom

Re: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Doug Ewell wrote: > Philippe Verdy wrote: > Speaking of which, I just noticed that the function in SC UniPad to > compose syllables from jamos does not handle this case (LV + T = LVT). > I'll have to report that to the UniPad team. Yudit, Mozilla and soon a whole bunch of

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Thu, 4 Dec 2003, Jungshik Shin wrote: > On Wed, 3 Dec 2003, Philippe Verdy wrote: > > That kind of composition/decomposition is necessary for linguistic > analysis of Korean. Search engines (e.g. google), rendering engines > and incremental searches also need that. See > > http://i18nl10n

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Philippe Verdy wrote: > I just have another question for Korean: many jamos are in fact composed > from other jamos: this is clearly visible both in their name and in their > composed glyph. What would be the linguistic impact of decomposing them (not > canonically!)? Do Korean

Re: decomposable Hangul jamos (was: Compression through normalization)

2003-12-03 Thread Philippe Verdy
Doug Ewell writes: > > I just have another question for Korean: many jamos are in fact > > composed from other jamos: this is clearly visible both in their name > > and in their composed glyph. What would be the linguistic impact of > > decomposing them (not canonically!)? Do Korean really learn th

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Philippe Verdy wrote: > Jungshik Shin writes: > > > > I already answered about it: I had mixed the letters TLV instead of > > > > LVT. All the above was correct if you swap the letters. So what I did > > > > really was to compose only VT but not LV nor LVT: > > > > > > > > ( ((

Re: Compression through normalization

2003-12-03 Thread Doug Ewell
Philippe Verdy wrote: > I still think that we could try to use only LV syllables but not LVT > syllables to reduce the set of Hangul character used if this helps > the final compressor. Aha, LV syllables. Now we are talking about something that exists and can be used in the manner you describe.

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Doug Ewell writes: > I just read C10 again and noticed that it says that character sequences > can be replaced by canonical-equivalent sequences -- NOT that they have > to end up in a particular normalization form. So your strategy of > converting to a form halfway between NFC and NFD seems accept

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Jungshik Shin writes: > > > I already answered about it: I had mixed the letters TLV instead of > > > LVT. All the above was correct if you swap the letters. So what I did > > > really was to compose only VT but not LV nor LVT: > > > > > > ( ((L* V* VT T*) - (L* V+ T)) | X )* > > > > > > I did it b

Re: Compression through normalization

2003-12-01 Thread Peter Kirk
On 01/12/2003 04:25, Philippe Verdy wrote: ... And what about a compressor that would identify the source as being Unicode, and would convert it first to NFC, but including composed forms for the compositions normally excluded from NFC? This seems marginal but some languages would have better

RE: Compression through normalization

2003-12-01 Thread jon
Quoting Philippe Verdy <[EMAIL PROTECTED]>: > [EMAIL PROTECTED] wrote: > > Further, a Unicode-aware algorithm would expect a choseong character to > > be followed by a jungseong and a jongseong to follow a jungsong, and > > could essentially perform the same benefits to compression that > > nor

RE: Compression through normalization

2003-12-01 Thread Philippe Verdy
[EMAIL PROTECTED] wrote: > Further, a Unicode-aware algorithm would expect a choseong character to > be followed by a jungseong and a jongseong to follow a jungsong, and > could essentially perform the same benefits to compression that > normalising to NFC perfroms but without making an irrevers

Re: Compression through normalization

2003-12-01 Thread jon
Quoting Doug Ewell <[EMAIL PROTECTED]>: > Someone, I forgot who, questioned whether converting Unicode text to NFC > would actually improve its compressibility, and asked if any actual data > was available. I was pretty sure converting to NFC would help compression (at least some of the time), I

Re: Compression through normalization

2003-11-30 Thread Doug Ewell
Jungshik Shin wrote: > I finally downloaded the file and took a look at it. I was surprised > to find that the text is the entire content of the volume 1 of a > famous Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae > (published in the early 1990's). This seems to be problematic b

Re: Compression through normalization

2003-11-30 Thread Jungshik Shin
On Sat, 29 Nov 2003, Doug Ewell wrote: > A longer and more realistic case can be seen in the sample Korean file > at: > > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt I finally downloaded the file and took a look at it. I was surprised to find that the text is the entir

Re: Compression through normalization

2003-11-29 Thread Doug Ewell
Someone, I forgot who, questioned whether converting Unicode text to NFC would actually improve its compressibility, and asked if any actual data was available. Certainly there is no guarantee that normalization would *always* result in a smaller file. A compressor that took advantage of normaliz

RE: Compression through normalization

2003-11-27 Thread Philippe Verdy
Doug Ewell writes: > Peter Kirk wrote: > > > Yes, the compressor can make any canonically equivalent change, not > > just composing composition exclusions but reordering combining marks > > in different classes. The only flaw I see is that the compressor does > > not have to undo these changes on

Re: Compression through normalization

2003-11-27 Thread Doug Ewell
Peter Kirk wrote: > Yes, the compressor can make any canonically equivalent change, not > just composing composition exclusions but reordering combining marks > in different classes. The only flaw I see is that the compressor does > not have to undo these changes on decompression; at least no oth

RE: Compression through normalization

2003-11-26 Thread D. Starner
> Use Base64 - it is stable through all normalisation forms. The problem with Base64 (and worse yet, PUA characters for bytes), is that it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%) on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%) on

RE: Compression through normalization

2003-11-26 Thread jon
> The whole point of such a tool would be to send binary data on a transport > that > only allowed Unicode text. In practice, you'd also have to remap C0 and C1 > characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to > U+0270-U+028F > wouldn't be too complex. Unless you've added a Uni

Re: Compression through normalization

2003-11-26 Thread Peter Kirk
On 26/11/2003 07:05, D. Starner wrote: ... The whole point of such a tool would be to send binary data on a transport that only allowed Unicode text. In practice, you'd also have to remap C0 and C1 characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to U+0270-U+028F wouldn't be too c

RE: Compression through normalization

2003-11-26 Thread D. Starner
> I see no reason why you accept some limitations for this > encapsulation, but not ALL the limitations. Because I can convert the data from binary to Unicode text in UTF-16 in a few lines of code if I don't worry about normalization. Suddenly the rules become much more complex if I have to worry

RE: Compression through normalization

2003-11-26 Thread Philippe Verdy
D. Starner writes: > > In the case of GIF versus JPG, which are usually regarded as "lossless" > > versus "lossy", please note that there /is/ no "orignal", in the sense > > of a stream of bytes. Why not? Because an image is not a stream of > > bytes. Period. > > GIF isn't a compression scheme

RE: Compression through normalization

2003-11-26 Thread Philippe Verdy
Peter Kirk [peterkirk at qaya dot org] writes: > On 25/11/2003 16:38, Doug Ewell wrote: > > >Philippe Verdy wrote: > > > >>So SCSU and BOCU-* formats are NOT general purpose compressors. As > >>they are defined only in terms of stream of Unicode code points, they > >>are assumed to follow the co

RE: Compression through normalization

2003-11-26 Thread D. Starner
> In the case of GIF versus JPG, which are usually regarded as "lossless" > versus "lossy", please note that there /is/ no "orignal", in the sense > of a stream of bytes. Why not? Because an image is not a stream of > bytes. Period. GIF isn't a compression scheme; it uses the LZW compression s

RE: Compression through normalization

2003-11-26 Thread jon
> In the case of GIF versus JPG, which are usually regarded as "lossless" > versus "lossy", please note that there /is/ no "orignal", in the sense > of a stream of bytes. Why not? Because an image is not a stream of > bytes. Period. What is being compressed here is a rectangular array of > pixe

Re: Compression through normalization

2003-11-26 Thread Peter Kirk
On 25/11/2003 16:38, Doug Ewell wrote: Philippe Verdy wrote: So SCSU and BOCU-* formats are NOT general purpose compressors. As they are defined only in terms of stream of Unicode code points, they are assumed to follow the conformance clauses of Unicode. As they recognize their input as Unic

RE: Compression through normalization

2003-11-26 Thread Arcane Jill
iginal Message- > From: Doug Ewell [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, November 25, 2003 7:09 PM > To: Unicode Mailing List; UnicoRe Mailing List > Subject: Re: Compression through normalization > > > Here's a summary of the responses so far: > > * Phi

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > So SCSU and BOCU-* formats are NOT general purpose compressors. As > they are defined only in terms of stream of Unicode code points, they > are assumed to follow the conformance clauses of Unicode. As they > recognize their input as Unicode text, they can recognize canoni

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Doug Ewell writes: > Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a > GP compression scheme. Atkin and Stansifer's paper from last year is > all about that, and I spend a few pages on it in my paper as well. You > can also re-Zip a Zip file, though, so I don't know what

RE: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Philippe Verdy
Rick McGowan writes: > John Cowan suggested... > > We will never come close to exceeding this limit. Essentially all new > > combining characters are either class 0 or fall into one of the > > 200-range positional classes. > > Or 9, for viramas. Or 1, for overlays. Don't forget them... Or 7, f

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > I say YES only for compressors that are supposed to work on Unicode > text (this applies to BOCU-1 and SCSU which are not intented to > compress anything else than Unicode text), but NO of course for > general purpose compressors (like deflate in zip files.) Of course. >

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Mark Davis writes: > I would say that a compressor can normalize, if (a) when decompressing it > produces NFC, and (b) it advertises that it normalizes. Why condition (a) ? NFD could be used as well, and even another normalization where combining characters are sorted differently, or partly recomp

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Doug Ewell writes: > * Philippe Verdy and and Jill Ramonsky say YES, a compressor can > normalize, because it knows it is operating on Unicode character data > and can take advantage of Unicode properties. I say YES only for compressors that are supposed to work on Unicode text (this applies to BO

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Rick McGowan
Of course, as usual, this is my opinion. UTC hasn't actually made any proclamations about what will or won't be done in terms of the classes or what kinds of classes might be assigned in the future. Rick > John Cowan suggested... > > > We will never come close to exceeding this limit.

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Peter Kirk
On 25/11/2003 08:55, Doug Ewell wrote: Normalization may or may not have an effect on compression. It has definitely been shown to have an effect on Hebrew combining marks. I must ask, however, that we try to keep these issues separate in discussion, and not let the compression topic, if there is

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Rick McGowan
John Cowan suggested... > We will never come close to exceeding this limit. Essentially all new > combining characters are either class 0 or fall into one of the 200-range > positional classes. Or 9, for viramas. One take-home point is that there won't be any more "fixed position" classes add

Re: Compression through normalization

2003-11-25 Thread Mark Davis
ROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]>; "UnicoRe Mailing List" <[EMAIL PROTECTED]> Sent: Tue, 2003 Nov 25 11:08 Subject: Re: Compression through normalization > Here's a summary of the responses so far: > > * Philippe Verdy and and J

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Mark Davis
gt; Sent: Tue, 2003 Nov 25 11:18 Subject: Re: Normalisation stability, was: Compression through normalization > Philippe Verdy wrote: > > > I'm not convinced that there's a significant improvement when only > > checking for noramlization but not perfomring it. It req

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Peter Kirk
On 25/11/2003 11:15, John Cowan wrote: Peter Kirk scripsit: If receivers are expected to check for normalisation, they are presumably expected also to normalise Not so. An alternative behavior, which is preferred in certain circumstances, is to reject the input, or at least to advise h

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread John Cowan
Peter Kirk scripsit: > If receivers are expected to check for normalisation, they are > presumably expected also to normalise Not so. An alternative behavior, which is preferred in certain circumstances, is to reject the input, or at least to advise higher layers that the input may be invalid.

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Here's a summary of the responses so far: * Philippe Verdy and and Jill Ramonsky say YES, a compressor can normalize, because it knows it is operating on Unicode character data and can take advantage of Unicode properties. * Peter Kirk and Mark Shoulson say NO, it can't, because all the compresso

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > I'm not convinced that there's a significant improvement when only > checking for noramlization but not perfomring it. It requires at least > a list of the characters are acceptable in a normalization form, and > as well their combining classes. UAX #15 begs to differ. S

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Peter Kirk
On 25/11/2003 10:03, John Cowan wrote: ... And as for canonical equivalence, the most efficient way to compare strings for it is to normalize both of them in some way and then do a raw binary compare. Since it adds efficiency to normalize only once, it is worthwhile to define a few normalization

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Doug Ewell
Peter Kirk wrote: > Well, Doug, I see your point; different topics should be kept > separate. But I changed the subject line precisely because the thread > has shifted from discussion of compression to a general discussion of > normalisation stability. That's true; most people would probably not

RE: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Philippe Verdy
John Cowan writes: > Since it adds efficiency to normalize only once, > it is worthwhile to define a few normalization forms and urge > people to produce text in one of them, so that receivers need not > normalize but need only check for normalization, typically much cheaper. I'm not convinced tha

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread John Cowan
Philippe Verdy scripsit: > I just wonder however why it was "crucial" (as Unicode says in its > Definitions chapter) to expect a relative order of distinct non-zero > combining classes. For me these combining classes are arbitrary not only on > their absolute value as they are now, but even their

RE: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Philippe Verdy
De : Peter Kirk [mailto:[EMAIL PROTECTED] > Envoye : mardi 25 novembre 2003 17:06 > A : [EMAIL PROTECTED] > Cc : [EMAIL PROTECTED] > Objet : Re: Normalisation stability, was: Compression through > normalization > > > On 25/11/2003 07:22, Philippe Verdy wrote: >

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Doug Ewell
Normalization may or may not have an effect on compression. It has definitely been shown to have an effect on Hebrew combining marks. I must ask, however, that we try to keep these issues separate in discussion, and not let the compression topic, if there is to be any, degenerate into a wing of t

Re: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Peter Kirk
On 25/11/2003 07:22, Philippe Verdy wrote: ... Composition exclusions have a lower impact as well as the relative orders of canonical classes, as they don't affect canonical equivalence of strings, and thus won't affect applications based on the Unicode C10 definition; they are important only to

RE: Normalisation stability, was: Compression through normalization

2003-11-25 Thread Philippe Verdy
> >So it's the absence of stability which would make impossible this > >rearrangement of normalization forms... > > Canonical equivalence is unaffected if combining classes are rearranged, > though not if they are split or joined. It is only the normalised forms > of strings which may be changed

Normalisation stability, was: Compression through normalization

2003-11-25 Thread Peter Kirk
On 24/11/2003 16:56, Philippe Verdy wrote: Peter Kirk writes: If conformance clause C10 is taken to be operable at all levels, this makes a nonsense of the concept of normalisation stability within databases etc. I don't think that the stability of normalization influence this: as long a

RE: Compression through normalization

2003-11-25 Thread Arcane Jill
I'm pretty sure it depends on whether you regard a text document as a sequence of characters, or as a sequence of glyphs. (Er - I mean "default grapheme clusters" of course). Regarded as a sequence of characters, normalisation changes that sequence. But regarded as a sequence of glyphs, normali

RE: Compression through normalization

2003-11-24 Thread Philippe Verdy
Peter Kirk writes: > If conformance clause C10 is taken to be operable at all levels, this > makes a nonsense of the concept of normalisation stability within > databases etc. I don't think that the stability of normalization influence this: as long as there's a guarantee of being able to restor

Re: Compression through normalization

2003-11-24 Thread Peter Kirk
On 24/11/2003 07:52, Mark E. Shoulson wrote: On 11/24/03 01:26, Doug Ewell wrote: So the question becomes: Is it legitimate for a Unicode compression engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into another (canonically equivalent) normalization form to improve its compres

Re: Compression through normalization

2003-11-24 Thread Mark E. Shoulson
On 11/24/03 01:26, Doug Ewell wrote: So the question becomes: Is it legitimate for a Unicode compression engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into another (canonically equivalent) normalization form to improve its compressibility? OK, this *is* a fascinating question

Compression through normalization (was: Re: Ternary search trees)

2003-11-24 Thread Doug Ewell
with its NFC equivalent and still claim "not to modify" its interpretation; there is no "loss of data" in the sense of a bitmap being converted to JPEG. Yet there is no bit-for-bit equivalence either; for a given text T, there is no promise that: decompress(compress(T)) â T If