Mark Davis writes:
> > OK, then I suppose I should play devil's advocate and ask Peter's and
> > Philippe's question again: If C10 only restricts the modifications to
> > "canonically equivalent sequences," why should there be an additional
> > restriction that further limits them to NFC or NFD?
com
â à â
- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Cc: "Kenneth Whistler" <[EMAIL PROTECTED]>
Sent: Fri, 2003 Dec 05 23:38
Subject: Re: Compression through norm
Peter Kirk wrote:
>> Subprocesses within a closed system may be able to make certain
>> assumptions for efficiency. Process B, for example, may know that
>> its only source of input is Process A, which is guaranteed always to
>> produce NFC. ...
>
> Does C9 actually allow this? Well, perhaps wit
On 06/12/2003 09:49, Doug Ewell wrote:
...
But as C10 does not mandate any normalized form (just canonical
equivalence of the results), I don't think that it requires that a
compressor should produce its result in either NFC or NFD form
Right. I know that. But Mark and Ken said it should,
Philippe Verdy wrote:
> First C10 only restricts modifications just to preserve all the
> semantics of the encoded text in any context. There are situations
> where this restriction does not apply: when performing text
> transformations (such as folding, or even substringing, which may or
> may n
On 06/12/2003 03:48, Philippe Verdy wrote:
...
But as C10 does not mandate any normalized form (just canonical equivalence
of the results), I don't think that it requires that a compressor should
produce its result in either NFC or NFD form
Instead I think that it's up to the next process to dete
Doug Ewell
> OK, then I suppose I should play devil's advocate and ask Peter's and
> Philippe's question again: If C10 only restricts the modifications to
> "canonically equivalent sequences," why should there be an additional
> restriction that further limits them to NFC or NFD? Or, put another
On Fri, 5 Dec 2003, Doug Ewell wrote:
> Philippe Verdy wrote:
>
> > Still in the same subject, how do the hold KSX standards for Han[g]ul
> > compare each other? If they are upward compatible, ans specify that
> > the conversion from an old text not using compound letters to the new
> > In
Kenneth Whistler wrote:
> I don't think either of our recommendations here are specific
> to compression issues.
They're not, but compression is what I'm focusing on right now, and your
recommendations do *apply* to compression.
> Basically, if a process tinkers around with changing sequences
>
Mark Davis wrote:
> Think you are missing a negative, see below.
>
>> Compression techniques may optionally replace certain sequences with
>> canonically equivalent sequences to improve efficiency, but *only* if
>> the output of the decompressed text is expected to be
> is not required to be
>> c
Philippe Verdy wrote:
> Still in the same subject, how do the hold KSX standards for Han[g]ul
> compare each other? If they are upward compatible, ans specify that
> the conversion from an old text not using compound letters to the new
> standard does not mandate their composition into compund ja
Well, in my dialect of Engish, 'ken' and 'can' are nearly indistinguishable, and
there are many "can's" in Unicode; probably more than "mark's".
I'm reminded of what farmer is supposed to have once said about his produce: "We
eat what we can, and what we can't, we can."
> P.S. On the other hand,
On 05/12/2003 14:01, Philippe Verdy wrote:
...
It's just a shame that what was considered as equivalent in the Korean
standards is considered as canonically distinct (and even compatibility
dictinct) in Unicode. This means that the same exact abstract Korean text
can have two distinct representat
At 13:13 -0800 2003-12-05, Kenneth Whistler wrote:
On the other hand, if you asked him nicely, Mark might find the more
marked form, NFD, to his liking, especially since it is likely to
contain more combining marks. Mark is definitely in favor of
markedness. I, on the other hand, am definitely
Mark Davis writes:
> Doug Ewell writes:
> > OK. So it's Mark, not me, who is unilaterally extending C10.
>
> Where on earth do you get that? I did say that, in practice, NFC should be
> produced, but that is simply a practical guideline, independent of C10.
I also think that the NFC form is not r
Doug asked:
> Mark indicated that a compression-decompression cycle should not only
> stick to canonical-equivalent sequences, which is what C10 requires, but
> should convert text only to NFC (if at all). Ken mentioned
> normalization "to forms NFC or NFD," but I'm not sure this was in the
> sam
On 05/12/2003 10:03, Mark Davis wrote:
OK. So it's Mark, not me, who is unilaterally extending C10.
Where on earth do you get that? I did say that, in practice, NFC should be
produced, but that is simply a practical guideline, independent of C10.
Mark
Well, of course "unilaterally extendi
t;Kenneth Whistler" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Fri, 2003 Dec 05 08:43
Subject: Re: Compression through normalization
> Kenneth Whistler wrote:
>
> > Canonical equivalence is about not modifying the interpretation of the
> > text. That
to.com
â à â
- Original Message -
From: "Peter Kirk" <[EMAIL PROTECTED]>
To: "Doug Ewell" <[EMAIL PROTECTED]>
Cc: "Unicode Mailing List" <[EMAIL PROTECTED]>
Sent: Fri, 2003 Dec 05 02:51
Subject: Re: Compression through normalization
> On
Kenneth Whistler wrote:
> Canonical equivalence is about not modifying the interpretation of the
> text. That is different from considerations about not changing the
> text, period.
>
> If some process using text is sensitive to *any* change in the text
> whatsover (CRC-checking or any form of di
On 05/12/2003 00:34, Doug Ewell wrote:
Peter Kirk wrote:
Surely ignoring Composition Exclusions is not unilaterally extending
C10. The excluded precomposed characters are still canonically
equivalent to the decomposed (and normalised) forms. And so composing
a text with them, for compression
Peter Kirk wrote:
> Surely ignoring Composition Exclusions is not unilaterally extending
> C10. The excluded precomposed characters are still canonically
> equivalent to the decomposed (and normalised) forms. And so composing
> a text with them, for compression or any other purpose, still conform
> If some process using text is sensitive to the *interpretation* of
> the text, i.e. it is concerned about the content and meaning of
> the letters involved, then normalization, to forms NFC or NFD,
> which only involve canonical equivalences, will *not* make a difference.
> Or to be more subtle a
Mark said:
> The operations of compression followed by decompression can conformantly produce
> any text that is canonically equivalent to the original without purporting to
> modify the text. (How the internal compressed format is determined is completely
> arbitrary - it could NFD, compress, dec
On 04/12/2003 08:39, Doug Ewell wrote:
...
(2) I am NOT interested in inventing a new normalization form, or any
variants on existing forms. Any approach that involves compatibility
equivalences, ignores the Composition Exclusions table, or creates
equivalences that do not exist in the Unicode
ument that it does so.
Mark
__
http://www.macchiato.com
â à â
- Original Message -
From: "Doug Ewell" <[EMAIL PROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>
Sent: Thu, 2003 Dec 04 0
Just to clear up some possible misconceptions that I think may have
developed:
This thread started when Philippe Verdy mentioned the possibility of
converting certain sequences of Unicode characters to a *canonically
equivalent sequence* to improve compression. An example was converting
Korean te
Kent Karlsson wrote:
> Philippe Verdy wrote:
> > If we count also the encoded modern LV and LVT johab syllables:
> >
> > ( ((Ls|Lm)+ (Vs|Vm)+) |
> > ((Ls|Lm)* (LsVs|LsVm|LmVs|LmVm) (Vs|Vm)*) |
> > ((Ls|Lm)* (LsVsTs|LsVmTs|LmVsTs|LmVmTs|
> >
Philippe Verdy wrote:
...
> letters each. Fortunately, the definition of Hangul syllable blocks need
> not be changed, as it works well with Hangul syllables as L+, V+, T*
> (where L, V, and T stand for single-letter jamos).
In fact the Unicode encoding of modern
Kent Karlsson writes:
> Philippe Verdy wrote:
>
> > I just have another question for Korean: many jamos are in fact
> > composed from other jamos: this is clearly visible both in their name
> > and in their composed glyph. What would be the linguistic impact of
> > decomposing them (not canoni
Philippe Verdy wrote:
> I just have another question for Korean: many jamos are in fact
> composed from other jamos: this is clearly visible both in their name
> and in their composed glyph. What would be the linguistic impact of
> decomposing them (not canonically!)? Do Korean really learn
Jungshik Shin scripsit:
> (e.g. 'enough' in English is logographic in a sense, isn't it?)
No, it's just obsolete. Some parts of Hangeul spelling are obsolete too.
"&", that's logographic.
--
John Cowan[EMAIL PROTECTED]
At times of peril or dubitation, h
Jungshik Shin writes:
> On Wed, 3 Dec 2003, Philippe Verdy wrote:
>
> > I just have another question for Korean: many jamos are in fact composed
> > from other jamos: this is clearly visible both in their name
> and in their
> > composed glyph. What would be the linguistic impact of
> decomposin
> De : Jungshik Shin [mailto:[EMAIL PROTECTED]
>Note that Korean syllables in Unicode are NOT "LVT?" as you
> seem to think
I did not say that...
> BUT "L+V+T*" with '+', '*' and '?' have usual RE meaning.
I said this:
( ((L* V* VT T*) - (L* V+ T)) | X )*
> Who said that? 11,172 precom
On Wed, 3 Dec 2003, Doug Ewell wrote:
> Philippe Verdy wrote:
> Speaking of which, I just noticed that the function in SC UniPad to
> compose syllables from jamos does not handle this case (LV + T = LVT).
> I'll have to report that to the UniPad team.
Yudit, Mozilla and soon a whole bunch of
On Thu, 4 Dec 2003, Jungshik Shin wrote:
> On Wed, 3 Dec 2003, Philippe Verdy wrote:
>
> That kind of composition/decomposition is necessary for linguistic
> analysis of Korean. Search engines (e.g. google), rendering engines
> and incremental searches also need that. See
>
> http://i18nl10n
On Wed, 3 Dec 2003, Philippe Verdy wrote:
> I just have another question for Korean: many jamos are in fact composed
> from other jamos: this is clearly visible both in their name and in their
> composed glyph. What would be the linguistic impact of decomposing them (not
> canonically!)? Do Korean
Doug Ewell writes:
> > I just have another question for Korean: many jamos are in fact
> > composed from other jamos: this is clearly visible both in their name
> > and in their composed glyph. What would be the linguistic impact of
> > decomposing them (not canonically!)? Do Korean really learn th
On Wed, 3 Dec 2003, Philippe Verdy wrote:
> Jungshik Shin writes:
> > > > I already answered about it: I had mixed the letters TLV instead of
> > > > LVT. All the above was correct if you swap the letters. So what I did
> > > > really was to compose only VT but not LV nor LVT:
> > > >
> > > > ( ((
Philippe Verdy wrote:
> I still think that we could try to use only LV syllables but not LVT
> syllables to reduce the set of Hangul character used if this helps
> the final compressor.
Aha, LV syllables. Now we are talking about something that exists and
can be used in the manner you describe.
Doug Ewell writes:
> I just read C10 again and noticed that it says that character sequences
> can be replaced by canonical-equivalent sequences -- NOT that they have
> to end up in a particular normalization form. So your strategy of
> converting to a form halfway between NFC and NFD seems accept
Jungshik Shin writes:
> > > I already answered about it: I had mixed the letters TLV instead of
> > > LVT. All the above was correct if you swap the letters. So what I did
> > > really was to compose only VT but not LV nor LVT:
> > >
> > > ( ((L* V* VT T*) - (L* V+ T)) | X )*
> > >
> > > I did it b
On 01/12/2003 04:25, Philippe Verdy wrote:
...
And what about a compressor that would identify the source as being
Unicode, and would convert it first to NFC, but including composed forms
for the compositions normally excluded from NFC? This seems marginal but
some languages would have better
Quoting Philippe Verdy <[EMAIL PROTECTED]>:
> [EMAIL PROTECTED] wrote:
> > Further, a Unicode-aware algorithm would expect a choseong character to
> > be followed by a jungseong and a jongseong to follow a jungsong, and
> > could essentially perform the same benefits to compression that
> > nor
[EMAIL PROTECTED] wrote:
> Further, a Unicode-aware algorithm would expect a choseong character to
> be followed by a jungseong and a jongseong to follow a jungsong, and
> could essentially perform the same benefits to compression that
> normalising to NFC perfroms but without making an irrevers
Quoting Doug Ewell <[EMAIL PROTECTED]>:
> Someone, I forgot who, questioned whether converting Unicode text to NFC
> would actually improve its compressibility, and asked if any actual data
> was available.
I was pretty sure converting to NFC would help compression (at least some of
the time), I
Jungshik Shin wrote:
> I finally downloaded the file and took a look at it. I was surprised
> to find that the text is the entire content of the volume 1 of a
> famous Korean novel (Arirang) by a _living_ Korean writer CHO Chongrae
> (published in the early 1990's). This seems to be problematic b
On Sat, 29 Nov 2003, Doug Ewell wrote:
> A longer and more realistic case can be seen in the sample Korean file
> at:
>
> http://www.cs.fit.edu/~ryan/compress/corpora/korean/arirang/arirang.txt
I finally downloaded the file and took a look at it. I was surprised
to find that the text is the entir
Someone, I forgot who, questioned whether converting Unicode text to NFC
would actually improve its compressibility, and asked if any actual data
was available.
Certainly there is no guarantee that normalization would *always* result
in a smaller file. A compressor that took advantage of normaliz
Doug Ewell writes:
> Peter Kirk wrote:
>
> > Yes, the compressor can make any canonically equivalent change, not
> > just composing composition exclusions but reordering combining marks
> > in different classes. The only flaw I see is that the compressor does
> > not have to undo these changes on
Peter Kirk wrote:
> Yes, the compressor can make any canonically equivalent change, not
> just composing composition exclusions but reordering combining marks
> in different classes. The only flaw I see is that the compressor does
> not have to undo these changes on decompression; at least no oth
> Use Base64 - it is stable through all normalisation forms.
The problem with Base64 (and worse yet, PUA characters for bytes), is that
it's inefficent. Base64 offers 6 bits per 8 (75%) on UTF-8, 6 bits per 16 (37%)
on UTF-16. You can get 15 bits per 16 (93%) on UTF-16 and 15 bits per 24 (62%)
on
> The whole point of such a tool would be to send binary data on a transport
> that
> only allowed Unicode text. In practice, you'd also have to remap C0 and C1
> characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to
> U+0270-U+028F
> wouldn't be too complex. Unless you've added a Uni
On 26/11/2003 07:05, D. Starner wrote:
...
The whole point of such a tool would be to send binary data on a transport that
only allowed Unicode text. In practice, you'd also have to remap C0 and C1
characters; but even then 0x00-0x1F -> U+0250-026F and 0x80-0x9F to U+0270-U+028F
wouldn't be too c
> I see no reason why you accept some limitations for this
> encapsulation, but not ALL the limitations.
Because I can convert the data from binary to Unicode text in UTF-16
in a few lines of code if I don't worry about normalization. Suddenly
the rules become much more complex if I have to worry
D. Starner writes:
> > In the case of GIF versus JPG, which are usually regarded as "lossless"
> > versus "lossy", please note that there /is/ no "orignal", in the sense
> > of a stream of bytes. Why not? Because an image is not a stream of
> > bytes. Period.
>
> GIF isn't a compression scheme
Peter Kirk [peterkirk at qaya dot org] writes:
> On 25/11/2003 16:38, Doug Ewell wrote:
>
> >Philippe Verdy wrote:
> >
> >>So SCSU and BOCU-* formats are NOT general purpose compressors. As
> >>they are defined only in terms of stream of Unicode code points, they
> >>are assumed to follow the co
> In the case of GIF versus JPG, which are usually regarded as "lossless"
> versus "lossy", please note that there /is/ no "orignal", in the sense
> of a stream of bytes. Why not? Because an image is not a stream of
> bytes. Period.
GIF isn't a compression scheme; it uses the LZW compression s
> In the case of GIF versus JPG, which are usually regarded as "lossless"
> versus "lossy", please note that there /is/ no "orignal", in the sense
> of a stream of bytes. Why not? Because an image is not a stream of
> bytes. Period. What is being compressed here is a rectangular array of
> pixe
On 25/11/2003 16:38, Doug Ewell wrote:
Philippe Verdy wrote:
So SCSU and BOCU-* formats are NOT general purpose compressors. As
they are defined only in terms of stream of Unicode code points, they
are assumed to follow the conformance clauses of Unicode. As they
recognize their input as Unic
iginal Message-
> From: Doug Ewell [mailto:[EMAIL PROTECTED]]
> Sent: Tuesday, November 25, 2003 7:09 PM
> To: Unicode Mailing List; UnicoRe Mailing List
> Subject: Re: Compression through normalization
>
>
> Here's a summary of the responses so far:
>
> * Phi
Philippe Verdy wrote:
> So SCSU and BOCU-* formats are NOT general purpose compressors. As
> they are defined only in terms of stream of Unicode code points, they
> are assumed to follow the conformance clauses of Unicode. As they
> recognize their input as Unicode text, they can recognize canoni
Doug Ewell writes:
> Yes, you can take SCSU- or BOCU-1-encoded text and recompress it using a
> GP compression scheme. Atkin and Stansifer's paper from last year is
> all about that, and I spend a few pages on it in my paper as well. You
> can also re-Zip a Zip file, though, so I don't know what
Rick McGowan writes:
> John Cowan suggested...
> > We will never come close to exceeding this limit. Essentially all new
> > combining characters are either class 0 or fall into one of the
> > 200-range positional classes.
>
> Or 9, for viramas.
Or 1, for overlays. Don't forget them...
Or 7, f
Philippe Verdy wrote:
> I say YES only for compressors that are supposed to work on Unicode
> text (this applies to BOCU-1 and SCSU which are not intented to
> compress anything else than Unicode text), but NO of course for
> general purpose compressors (like deflate in zip files.)
Of course.
>
Mark Davis writes:
> I would say that a compressor can normalize, if (a) when decompressing it
> produces NFC, and (b) it advertises that it normalizes.
Why condition (a) ? NFD could be used as well, and even another
normalization where combining characters are sorted differently, or partly
recomp
Doug Ewell writes:
> * Philippe Verdy and and Jill Ramonsky say YES, a compressor can
> normalize, because it knows it is operating on Unicode character data
> and can take advantage of Unicode properties.
I say YES only for compressors that are supposed to work on Unicode text
(this applies to BO
Of course, as usual, this is my opinion. UTC hasn't actually made any
proclamations about what will or won't be done in terms of the classes or
what kinds of classes might be assigned in the future.
Rick
> John Cowan suggested...
>
> > We will never come close to exceeding this limit.
On 25/11/2003 08:55, Doug Ewell wrote:
Normalization may or may not have an effect on compression. It has
definitely been shown to have an effect on Hebrew combining marks.
I must ask, however, that we try to keep these issues separate in
discussion, and not let the compression topic, if there is
John Cowan suggested...
> We will never come close to exceeding this limit. Essentially all new
> combining characters are either class 0 or fall into one of the 200-range
> positional classes.
Or 9, for viramas.
One take-home point is that there won't be any more "fixed position"
classes add
ROTECTED]>
To: "Unicode Mailing List" <[EMAIL PROTECTED]>; "UnicoRe Mailing List"
<[EMAIL PROTECTED]>
Sent: Tue, 2003 Nov 25 11:08
Subject: Re: Compression through normalization
> Here's a summary of the responses so far:
>
> * Philippe Verdy and and J
gt;
Sent: Tue, 2003 Nov 25 11:18
Subject: Re: Normalisation stability, was: Compression through normalization
> Philippe Verdy wrote:
>
> > I'm not convinced that there's a significant improvement when only
> > checking for noramlization but not perfomring it. It req
On 25/11/2003 11:15, John Cowan wrote:
Peter Kirk scripsit:
If receivers are expected to check for normalisation, they are
presumably expected also to normalise
Not so. An alternative behavior, which is preferred in certain circumstances,
is to reject the input, or at least to advise h
Peter Kirk scripsit:
> If receivers are expected to check for normalisation, they are
> presumably expected also to normalise
Not so. An alternative behavior, which is preferred in certain circumstances,
is to reject the input, or at least to advise higher layers that the input
may be invalid.
Here's a summary of the responses so far:
* Philippe Verdy and and Jill Ramonsky say YES, a compressor can
normalize, because it knows it is operating on Unicode character data
and can take advantage of Unicode properties.
* Peter Kirk and Mark Shoulson say NO, it can't, because all the
compresso
Philippe Verdy wrote:
> I'm not convinced that there's a significant improvement when only
> checking for noramlization but not perfomring it. It requires at least
> a list of the characters are acceptable in a normalization form, and
> as well their combining classes.
UAX #15 begs to differ. S
On 25/11/2003 10:03, John Cowan wrote:
... And as for
canonical equivalence, the most efficient way to compare strings for
it is to normalize both of them in some way and then do a raw
binary compare. Since it adds efficiency to normalize only once,
it is worthwhile to define a few normalization
Peter Kirk wrote:
> Well, Doug, I see your point; different topics should be kept
> separate. But I changed the subject line precisely because the thread
> has shifted from discussion of compression to a general discussion of
> normalisation stability.
That's true; most people would probably not
John Cowan writes:
> Since it adds efficiency to normalize only once,
> it is worthwhile to define a few normalization forms and urge
> people to produce text in one of them, so that receivers need not
> normalize but need only check for normalization, typically much cheaper.
I'm not convinced tha
Philippe Verdy scripsit:
> I just wonder however why it was "crucial" (as Unicode says in its
> Definitions chapter) to expect a relative order of distinct non-zero
> combining classes. For me these combining classes are arbitrary not only on
> their absolute value as they are now, but even their
De : Peter Kirk [mailto:[EMAIL PROTECTED]
> Envoye : mardi 25 novembre 2003 17:06
> A : [EMAIL PROTECTED]
> Cc : [EMAIL PROTECTED]
> Objet : Re: Normalisation stability, was: Compression through
> normalization
>
>
> On 25/11/2003 07:22, Philippe Verdy wrote:
>
Normalization may or may not have an effect on compression. It has
definitely been shown to have an effect on Hebrew combining marks.
I must ask, however, that we try to keep these issues separate in
discussion, and not let the compression topic, if there is to be any,
degenerate into a wing of t
On 25/11/2003 07:22, Philippe Verdy wrote:
...
Composition exclusions have a lower impact as well as the relative orders of
canonical classes, as they don't affect canonical equivalence of strings,
and thus won't affect applications based on the Unicode C10 definition; they
are important only to
> >So it's the absence of stability which would make impossible this
> >rearrangement of normalization forms...
>
> Canonical equivalence is unaffected if combining classes are rearranged,
> though not if they are split or joined. It is only the normalised forms
> of strings which may be changed
On 24/11/2003 16:56, Philippe Verdy wrote:
Peter Kirk writes:
If conformance clause C10 is taken to be operable at all levels, this
makes a nonsense of the concept of normalisation stability within
databases etc.
I don't think that the stability of normalization influence this: as long a
I'm pretty sure it depends on whether you regard a text document as a
sequence of characters, or as a sequence of glyphs. (Er - I mean
"default grapheme clusters" of course). Regarded as a sequence of
characters, normalisation changes that sequence. But regarded as a
sequence of glyphs, normali
Peter Kirk writes:
> If conformance clause C10 is taken to be operable at all levels, this
> makes a nonsense of the concept of normalisation stability within
> databases etc.
I don't think that the stability of normalization influence this: as long as
there's a guarantee of being able to restor
On 24/11/2003 07:52, Mark E. Shoulson wrote:
On 11/24/03 01:26, Doug Ewell wrote:
So the question becomes: Is it legitimate for a Unicode compression
engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
another (canonically equivalent) normalization form to improve its
compres
On 11/24/03 01:26, Doug Ewell wrote:
So the question becomes: Is it legitimate for a Unicode compression
engine -- SCSU, BOCU-1, or other -- to convert text such as Hangul into
another (canonically equivalent) normalization form to improve its
compressibility?
OK, this *is* a fascinating question
with its NFC equivalent and still claim "not to
modify" its interpretation; there is no "loss of data" in the sense of a
bitmap being converted to JPEG. Yet there is no bit-for-bit equivalence
either; for a given text T, there is no promise that:
decompress(compress(T)) â T
If
90 matches
Mail list logo