RE: Compression through normalization

2003-12-06 Thread Philippe Verdy
Mark Davis writes: > > OK, then I suppose I should play devil's advocate and ask Peter's and > > Philippe's question again: If C10 only restricts the modifications to > > "canonically equivalent sequences," why should there be an additional > > restriction that further limits them to NFC or NFD?

Re: Compression through normalization

2003-12-06 Thread Mark Davis
com â à â - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Cc: "Kenneth Whistler" <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 23:38 Subject: Re: Compression through norm

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Peter Kirk wrote: >> Subprocesses within a closed system may be able to make certain >> assumptions for efficiency. Process B, for example, may know that >> its only source of input is Process A, which is guaranteed always to >> produce NFC. ... > > Does C9 actually allow this? Well, perhaps wit

Re: Compression through normalization

2003-12-06 Thread Peter Kirk
On 06/12/2003 09:49, Doug Ewell wrote: ... But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form Right. I know that. But Mark and Ken said it should,

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Philippe Verdy wrote: > First C10 only restricts modifications just to preserve all the > semantics of the encoded text in any context. There are situations > where this restriction does not apply: when performing text > transformations (such as folding, or even substringing, which may or > may n

Re: Compression through normalization

2003-12-06 Thread Peter Kirk
On 06/12/2003 03:48, Philippe Verdy wrote: ... But as C10 does not mandate any normalized form (just canonical equivalence of the results), I don't think that it requires that a compressor should produce its result in either NFC or NFD form Instead I think that it's up to the next process to dete

RE: Compression through normalization

2003-12-06 Thread Philippe Verdy
Doug Ewell > OK, then I suppose I should play devil's advocate and ask Peter's and > Philippe's question again: If C10 only restricts the modifications to > "canonically equivalent sequences," why should there be an additional > restriction that further limits them to NFC or NFD? Or, put another

Re: Compression through normalization

2003-12-06 Thread Jungshik Shin
On Fri, 5 Dec 2003, Doug Ewell wrote: > Philippe Verdy wrote: > > > Still in the same subject, how do the hold KSX standards for Han[g]ul > > compare each other? If they are upward compatible, ans specify that > > the conversion from an old text not using compound letters to the new > > In

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Kenneth Whistler wrote: > I don't think either of our recommendations here are specific > to compression issues. They're not, but compression is what I'm focusing on right now, and your recommendations do *apply* to compression. > Basically, if a process tinkers around with changing sequences >

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Mark Davis wrote: > Think you are missing a negative, see below. > >> Compression techniques may optionally replace certain sequences with >> canonically equivalent sequences to improve efficiency, but *only* if >> the output of the decompressed text is expected to be > is not required to be >> c

Re: Compression through normalization

2003-12-06 Thread Doug Ewell
Philippe Verdy wrote: > Still in the same subject, how do the hold KSX standards for Han[g]ul > compare each other? If they are upward compatible, ans specify that > the conversion from an old text not using compound letters to the new > standard does not mandate their composition into compund ja

Re: Compression through normalization

2003-12-05 Thread Mark Davis
Well, in my dialect of Engish, 'ken' and 'can' are nearly indistinguishable, and there are many "can's" in Unicode; probably more than "mark's". I'm reminded of what farmer is supposed to have once said about his produce: "We eat what we can, and what we can't, we can." > P.S. On the other hand,

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 14:01, Philippe Verdy wrote: ... It's just a shame that what was considered as equivalent in the Korean standards is considered as canonically distinct (and even compatibility dictinct) in Unicode. This means that the same exact abstract Korean text can have two distinct representat

Re: Compression through normalization

2003-12-05 Thread Michael Everson
At 13:13 -0800 2003-12-05, Kenneth Whistler wrote: On the other hand, if you asked him nicely, Mark might find the more marked form, NFD, to his liking, especially since it is likely to contain more combining marks. Mark is definitely in favor of markedness. I, on the other hand, am definitely

RE: Compression through normalization

2003-12-05 Thread Philippe Verdy
Mark Davis writes: > Doug Ewell writes: > > OK. So it's Mark, not me, who is unilaterally extending C10. > > Where on earth do you get that? I did say that, in practice, NFC should be > produced, but that is simply a practical guideline, independent of C10. I also think that the NFC form is not r

Re: Compression through normalization

2003-12-05 Thread Kenneth Whistler
Doug asked: > Mark indicated that a compression-decompression cycle should not only > stick to canonical-equivalent sequences, which is what C10 requires, but > should convert text only to NFC (if at all). Ken mentioned > normalization "to forms NFC or NFD," but I'm not sure this was in the > sam

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 10:03, Mark Davis wrote: OK. So it's Mark, not me, who is unilaterally extending C10. Where on earth do you get that? I did say that, in practice, NFC should be produced, but that is simply a practical guideline, independent of C10. Mark Well, of course "unilaterally extendi

Re: Compression through normalization

2003-12-05 Thread Mark Davis
t;Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 08:43 Subject: Re: Compression through normalization > Kenneth Whistler wrote: > > > Canonical equivalence is about not modifying the interpretation of the > > text. That

Re: Compression through normalization

2003-12-05 Thread Mark Davis
to.com â à â - Original Message - From: "Peter Kirk" <[EMAIL PROTECTED]> To: "Doug Ewell" <[EMAIL PROTECTED]> Cc: "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Fri, 2003 Dec 05 02:51 Subject: Re: Compression through normalization > On

Re: Compression through normalization

2003-12-05 Thread Doug Ewell
Kenneth Whistler wrote: > Canonical equivalence is about not modifying the interpretation of the > text. That is different from considerations about not changing the > text, period. > > If some process using text is sensitive to *any* change in the text > whatsover (CRC-checking or any form of di

Re: Compression through normalization

2003-12-05 Thread Peter Kirk
On 05/12/2003 00:34, Doug Ewell wrote: Peter Kirk wrote: Surely ignoring Composition Exclusions is not unilaterally extending C10. The excluded precomposed characters are still canonically equivalent to the decomposed (and normalised) forms. And so composing a text with them, for compression

Re: Compression through normalization

2003-12-05 Thread Doug Ewell
Peter Kirk wrote: > Surely ignoring Composition Exclusions is not unilaterally extending > C10. The excluded precomposed characters are still canonically > equivalent to the decomposed (and normalised) forms. And so composing > a text with them, for compression or any other purpose, still conform

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
> If some process using text is sensitive to the *interpretation* of > the text, i.e. it is concerned about the content and meaning of > the letters involved, then normalization, to forms NFC or NFD, > which only involve canonical equivalences, will *not* make a difference. > Or to be more subtle a

Re: Compression through normalization

2003-12-04 Thread Kenneth Whistler
Mark said: > The operations of compression followed by decompression can conformantly produce > any text that is canonically equivalent to the original without purporting to > modify the text. (How the internal compressed format is determined is completely > arbitrary - it could NFD, compress, dec

Re: Compression through normalization

2003-12-04 Thread Peter Kirk
On 04/12/2003 08:39, Doug Ewell wrote: ... (2) I am NOT interested in inventing a new normalization form, or any variants on existing forms. Any approach that involves compatibility equivalences, ignores the Composition Exclusions table, or creates equivalences that do not exist in the Unicode

Re: Compression through normalization

2003-12-04 Thread Mark Davis
ument that it does so. Mark __ http://www.macchiato.com â à â - Original Message - From: "Doug Ewell" <[EMAIL PROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]> Sent: Thu, 2003 Dec 04 0

Re: Compression through normalization

2003-12-04 Thread Doug Ewell
Just to clear up some possible misconceptions that I think may have developed: This thread started when Philippe Verdy mentioned the possibility of converting certain sequences of Unicode characters to a *canonically equivalent sequence* to improve compression. An example was converting Korean te

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
Kent Karlsson wrote: > Philippe Verdy wrote: > > If we count also the encoded modern LV and LVT johab syllables: > > > > ( ((Ls|Lm)+ (Vs|Vm)+) | > > ((Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)*) | > > ((Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs| > >

RE: Compression through normalization

2003-12-04 Thread Kent Karlsson
Philippe Verdy wrote: ... > letters each. Fortunately, the definition of Hangul syllable blocks need > not be changed, as it works well with Hangul syllables as L+, V+, T* > (where L, V, and T stand for single-letter jamos). In fact the Unicode encoding of modern

RE: Compression through normalization

2003-12-04 Thread Philippe Verdy
Kent Karlsson writes: > Philippe Verdy wrote: > > > I just have another question for Korean: many jamos are in fact > > composed from other jamos: this is clearly visible both in their name > > and in their composed glyph. What would be the linguistic impact of > > decomposing them (not canoni

RE: Compression through normalization

2003-12-04 Thread Kent Karlsson
Philippe Verdy wrote: > I just have another question for Korean: many jamos are in fact > composed from other jamos: this is clearly visible both in their name > and in their composed glyph. What would be the linguistic impact of > decomposing them (not canonically!)? Do Korean really learn

Re: Compression through normalization

2003-12-03 Thread John Cowan
Jungshik Shin scripsit: > (e.g. 'enough' in English is logographic in a sense, isn't it?) No, it's just obsolete. Some parts of Hangeul spelling are obsolete too. "&", that's logographic. -- John Cowan[EMAIL PROTECTED] At times of peril or dubitation, h

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Jungshik Shin writes: > On Wed, 3 Dec 2003, Philippe Verdy wrote: > > > I just have another question for Korean: many jamos are in fact composed > > from other jamos: this is clearly visible both in their name > and in their > > composed glyph. What would be the linguistic impact of > decomposin

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
> De : Jungshik Shin [mailto:[EMAIL PROTECTED] >Note that Korean syllables in Unicode are NOT "LVT?" as you > seem to think I did not say that... > BUT "L+V+T*" with '+', '*' and '?' have usual RE meaning. I said this: ( ((L* V* VT T*) - (L* V+ T)) | X )* > Who said that? 11,172 precom

Re: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Doug Ewell wrote: > Philippe Verdy wrote: > Speaking of which, I just noticed that the function in SC UniPad to > compose syllables from jamos does not handle this case (LV + T = LVT). > I'll have to report that to the UniPad team. Yudit, Mozilla and soon a whole bunch of

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Thu, 4 Dec 2003, Jungshik Shin wrote: > On Wed, 3 Dec 2003, Philippe Verdy wrote: > > That kind of composition/decomposition is necessary for linguistic > analysis of Korean. Search engines (e.g. google), rendering engines > and incremental searches also need that. See > > http://i18nl10n

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Philippe Verdy wrote: > I just have another question for Korean: many jamos are in fact composed > from other jamos: this is clearly visible both in their name and in their > composed glyph. What would be the linguistic impact of decomposing them (not > canonically!)? Do Korean

RE: Compression through normalization

2003-12-03 Thread Jungshik Shin
On Wed, 3 Dec 2003, Philippe Verdy wrote: > Jungshik Shin writes: > > > > I already answered about it: I had mixed the letters TLV instead of > > > > LVT. All the above was correct if you swap the letters. So what I did > > > > really was to compose only VT but not LV nor LVT: > > > > > > > > ( ((

Re: Compression through normalization

2003-12-03 Thread Doug Ewell
Philippe Verdy wrote: > I still think that we could try to use only LV syllables but not LVT > syllables to reduce the set of Hangul character used if this helps > the final compressor. Aha, LV syllables. Now we are talking about something that exists and can be used in the manner you describe.

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Doug Ewell writes: > I just read C10 again and noticed that it says that character sequences > can be replaced by canonical-equivalent sequences -- NOT that they have > to end up in a particular normalization form. So your strategy of > converting to a form halfway between NFC and NFD seems accept

RE: Compression through normalization

2003-12-03 Thread Philippe Verdy
Jungshik Shin writes: > > > I already answered about it: I had mixed the letters TLV instead of > > > LVT. All the above was correct if you swap the letters. So what I did > > > really was to compose only VT but not LV nor LVT: > > > > > > ( ((L* V* VT T*) - (L* V+ T)) | X )* > > > > > > I did it b

Re: Compression through normalization

2003-12-01 Thread Peter Kirk
On 01/12/2003 04:25, Philippe Verdy wrote: ... And what about a compressor that would identify the source as being Unicode, and would convert it first to NFC, but including composed forms for the compositions normally excluded from NFC? This seems marginal but some languages would have better

RE: Compression through normalization

2003-12-01 Thread jon
Quoting Philippe Verdy <[EMAIL PROTECTED]>: > [EMAIL PROTECTED] wrote: > > Further, a Unicode-aware algorithm would expect a choseong character to > > be followed by a jungseong and a jongseong to follow a jungsong, and > > could essentially perform the same benefits to compression that > > nor

RE: Compression through normalization

2003-12-01 Thread Philippe Verdy
[EMAIL PROTECTED] wrote: > Further, a Unicode-aware algorithm would expect a choseong character to > be followed by a jungseong and a jongseong to follow a jungsong, and > could essentially perform the same benefits to compression that > normalising to NFC perfroms but without making an irrevers

Re: Compression through normalization

2003-12-01 Thread jon
Quoting Doug Ewell <[EMAIL PROTECTED]>: > Someone, I forgot who, questioned whether converting Unicode text to NFC > would actually improve its compressibility, and asked if any actual data > was available. I was pretty sure converting to NFC would help compression (at least some of the time), I

Re: Compression through normalization

2003-11-30 Thread Doug Ewell
Jungshik Shin wrote: > I finally downloaded the file and took a look at it. I was surprised > to find that the text is the entire content of the volume 1 of a > famous Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae > (published in the early 1990's). This seems to be problematic b

Re: Compression through normalization

2003-11-30 Thread Jungshik Shin
On Sat, 29 Nov 2003, Doug Ewell wrote: > A longer and more realistic case can be seen in the sample Korean file > at: > > http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt I finally downloaded the file and took a look at it. I was surprised to find that the text is the entir

Re: Compression through normalization

2003-11-29 Thread Doug Ewell
Someone, I forgot who, questioned whether converting Unicode text to NFC would actually improve its compressibility, and asked if any actual data was available. Certainly there is no guarantee that normalization would *always* result in a smaller file. A compressor that took advantage of normaliz

RE: Compression through normalization

2003-11-27 Thread Philippe Verdy
Doug Ewell writes: > Peter Kirk wrote: > > > Yes, the compressor can make any canonically equivalent change, not > > just composing composition exclusions but reordering combining marks > > in different classes. The only flaw I see is that the compressor does > > not have to undo these changes on

Re: Compression through normalization

2003-11-27 Thread Doug Ewell
Peter Kirk wrote: > Yes, the compressor can make any canonically equivalent change, not > just composing composition exclusions but reordering combining marks > in different classes. The only flaw I see is that the compressor does > not have to undo these changes on decompression; at least no oth

RE: Compression through normalization

2003-11-26 Thread D. Starner
> Use Base64 - it is stable through all normalisation forms. The problem with Base64 (and worse yet, PUA characters for bytes), is that it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%) on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%) on

RE: Compression through normalization

2003-11-26 Thread jon
> The whole point of such a tool would be to send binary data on a transport > that > only allowed Unicode text. In practice, you'd also have to remap C0 and C1 > characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to > U+0270-U+028F > wouldn't be too complex. Unless you've added a Uni

Re: Compression through normalization

2003-11-26 Thread Peter Kirk
On 26/11/2003 07:05, D. Starner wrote: ... The whole point of such a tool would be to send binary data on a transport that only allowed Unicode text. In practice, you'd also have to remap C0 and C1 characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to U+0270-U+028F wouldn't be too c

RE: Compression through normalization

2003-11-26 Thread D. Starner
> I see no reason why you accept some limitations for this > encapsulation, but not ALL the limitations. Because I can convert the data from binary to Unicode text in UTF-16 in a few lines of code if I don't worry about normalization. Suddenly the rules become much more complex if I have to worry

RE: Compression through normalization

2003-11-26 Thread Philippe Verdy
D. Starner writes: > > In the case of GIF versus JPG, which are usually regarded as "lossless" > > versus "lossy", please note that there /is/ no "orignal", in the sense > > of a stream of bytes. Why not? Because an image is not a stream of > > bytes. Period. > > GIF isn't a compression scheme

RE: Compression through normalization

2003-11-26 Thread Philippe Verdy
Peter Kirk [peterkirk at qaya dot org] writes: > On 25/11/2003 16:38, Doug Ewell wrote: > > >Philippe Verdy wrote: > > > >>So SCSU and BOCU-* formats are NOT general purpose compressors. As > >>they are defined only in terms of stream of Unicode code points, they > >>are assumed to follow the co

RE: Compression through normalization

2003-11-26 Thread D. Starner
> In the case of GIF versus JPG, which are usually regarded as "lossless" > versus "lossy", please note that there /is/ no "orignal", in the sense > of a stream of bytes. Why not? Because an image is not a stream of > bytes. Period. GIF isn't a compression scheme; it uses the LZW compression s

RE: Compression through normalization

2003-11-26 Thread jon
> In the case of GIF versus JPG, which are usually regarded as "lossless" > versus "lossy", please note that there /is/ no "orignal", in the sense > of a stream of bytes. Why not? Because an image is not a stream of > bytes. Period. What is being compressed here is a rectangular array of > pixe

Re: Compression through normalization

2003-11-26 Thread Peter Kirk
On 25/11/2003 16:38, Doug Ewell wrote: Philippe Verdy wrote: So SCSU and BOCU-* formats are NOT general purpose compressors. As they are defined only in terms of stream of Unicode code points, they are assumed to follow the conformance clauses of Unicode. As they recognize their input as Unic

RE: Compression through normalization

2003-11-26 Thread Arcane Jill
iginal Message- > From: Doug Ewell [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, November 25, 2003 7:09 PM > To: Unicode Mailing List; UnicoRe Mailing List > Subject: Re: Compression through normalization > > > Here's a summary of the responses so far: > > * Phi

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > So SCSU and BOCU-* formats are NOT general purpose compressors. As > they are defined only in terms of stream of Unicode code points, they > are assumed to follow the conformance clauses of Unicode. As they > recognize their input as Unicode text, they can recognize canoni

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Doug Ewell writes: > Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a > GP compression scheme. Atkin and Stansifer's paper from last year is > all about that, and I spend a few pages on it in my paper as well. You > can also re-Zip a Zip file, though, so I don't know what

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Philippe Verdy wrote: > I say YES only for compressors that are supposed to work on Unicode > text (this applies to BOCU-1 and SCSU which are not intented to > compress anything else than Unicode text), but NO of course for > general purpose compressors (like deflate in zip files.) Of course. >

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Mark Davis writes: > I would say that a compressor can normalize, if (a) when decompressing it > produces NFC, and (b) it advertises that it normalizes. Why condition (a) ? NFD could be used as well, and even another normalization where combining characters are sorted differently, or partly recomp

RE: Compression through normalization

2003-11-25 Thread Philippe Verdy
Doug Ewell writes: > * Philippe Verdy and and Jill Ramonsky say YES, a compressor can > normalize, because it knows it is operating on Unicode character data > and can take advantage of Unicode properties. I say YES only for compressors that are supposed to work on Unicode text (this applies to BO

Re: Compression through normalization

2003-11-25 Thread Mark Davis
ROTECTED]> To: "Unicode Mailing List" <[EMAIL PROTECTED]>; "UnicoRe Mailing List" <[EMAIL PROTECTED]> Sent: Tue, 2003 Nov 25 11:08 Subject: Re: Compression through normalization > Here's a summary of the responses so far: > > * Philippe Verdy and and J

Re: Compression through normalization

2003-11-25 Thread Doug Ewell
Here's a summary of the responses so far: * Philippe Verdy and and Jill Ramonsky say YES, a compressor can normalize, because it knows it is operating on Unicode character data and can take advantage of Unicode properties. * Peter Kirk and Mark Shoulson say NO, it can't, because all the compresso

RE: Compression through normalization

2003-11-25 Thread Arcane Jill
I'm pretty sure it depends on whether you regard a text document as a sequence of characters, or as a sequence of glyphs. (Er - I mean "default grapheme clusters" of course). Regarded as a sequence of characters, normalisation changes that sequence. But regarded as a sequence of glyphs, normali

RE: Compression through normalization

2003-11-24 Thread Philippe Verdy
Peter Kirk writes: > If conformance clause C10 is taken to be operable at all levels, this > makes a nonsense of the concept of normalisation stability within > databases etc. I don't think that the stability of normalization influence this: as long as there's a guarantee of being able to restor

Re: Compression through normalization

2003-11-24 Thread Peter Kirk
On 24/11/2003 07:52, Mark E. Shoulson wrote: On 11/24/03 01:26, Doug Ewell wrote: So the question becomes: Is it legitimate for a Unicode compression engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into another (canonically equivalent) normalization form to improve its compres

Re: Compression through normalization

2003-11-24 Thread Mark E. Shoulson
On 11/24/03 01:26, Doug Ewell wrote: So the question becomes: Is it legitimate for a Unicode compression engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into another (canonically equivalent) normalization form to improve its compressibility? OK, this *is* a fascinating question