At 01:07 PM 3/3/03 -0800, Mark Davis wrote:
If your converter purports to produce any one of the Unicode encoding forms,
then it cannot conformantly produce malformed Unicode as a result.
If, of course, it does not purport to do that, it can do anything it wants
to.
Then, as long as the documentat
PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]>
Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, March 03, 2003 12:17
Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
t;; "Kent Karlsson"
<[EMAIL PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]>
Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Monday, March 03, 2003 11:21
Subject: Re: UTF-8 Error Handling (was: Re:
At 11:52 AM 3/3/03 -0800, Mark Davis wrote:
Perhaps I wasn't clear; I agree with you on that.
1) It is conformant to skip or substitute text, with just a code at the end
indicating that something of that sort was done.
It's a subtle point, but can be put into your formulation:
What I was after is
"Mark Davis" <[EMAIL PROTECTED]>; "Kent Karlsson"
<[EMAIL PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]>
Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sunday, March 02,
ng'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sunday, March 02, 2003 21:10
Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
> At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
> > >"C12a When a process interprets a code
I am not sure yet how far I want to get into this discussion... but this seems worth mentioning:
Asmus Freytag wrote:
The ideal case is one where the converter stops in a restartable
configuration, allowing the client to implement (or ask for) a variety
of error-recovery options.
A nice descript
At 07:21 AM 3/2/03 -0800, Mark Davis wrote:
>"C12a When a process interprets a code unit sequence which
> purports to be in a Unicode character encoding form, it
> shall treat ill-formed code unit sequences as an error
> condition, and shall not interpret such sequences as
> cha
From: "Mark Davis" <[EMAIL PROTECTED]>
> I agree with Kent that it is somewhat less robust to simply remove
> ill-formed sequences, since it removes any indication that the data
was
> corrupted.
Nice that the API gives one the option to choose, huh? ;-)
The notion of continuing (even if one is
lt;[EMAIL PROTECTED]>; <[EMAIL PROTECTED]>
Sent: Sunday, March 02, 2003 02:00
Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for
review)
>
>
> Michael (michka) Kaplan:
> ...
> > then the conversion will simply strip the errant characters. No
Michael (michka) Kaplan:
...
> then the conversion will simply strip the errant characters. Note that
> either solution meets the needs of refusal to interpret the errant
> sequences.
Simply stripping the errant byte sequences means that they are
each interpreted as the empty string of character
From: "Yung-Fong Tang" <[EMAIL PROTECTED]>
> When you deal with encoding which need states (ISO-2022,
ISO-2022-JP,
> etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the
> situration is different.
Unicode cannot of course speak for those other encodings, but it can
speak for UTF-8.
Kenneth Whistler wrote:
Think of it this way. Does anyone expect the ASCII standard to tell,
in detail, what a process should or should not do if it receives
data which purports to be ASCII, but which contains an 0x80 byte
in it? All the ASCII standard can really do is tell you that
0x80 is not
Doug Ewell wrote:
Yung-Fong Tang wrote:
So... in the future, in order to ensure we have a good software
environment, we not only need to make the Unicode 4.0 clear, but also
need to speed up the revision of those RFCs.
But the Unicode Consortium and UTC have no contr
Thanks to let me know. I guess I didn't spend enugh time with www.unicode.org
these days :) when do you add those PDF there ? It used to have only partial
sesssion available... but that is probably story several years ago
Roozbeh Pournader wrote:
On Thu, 27 Feb 2003, Mark Davis wrote:
On Thu, 27 Feb 2003, Mark Davis wrote:
> The Unicode Standard *is* free of charge; the entire text is posted on
> www.unicode.org.
Well, free of charge to *read personally on the screen*, of course. You
can't print the major versions yourself, Addison-Wesley must be asked for
that ;) And you can'
Stefan Persson suggested:
> >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
> >sequences. There were two types:
> >
> > a. 0xC0 0x80 for U+ (instead of 0x00)
> > b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80
0x80)
> >
> >
> Ah, but encoding NUL
Kenneth Whistler wrote:
> Yes, it is true. All the standard *mandates* is what I quoted
> previously in this thread:
>
> "C12a When a process interprets a code unit sequence which purports
> to be in a Unicode character encoding form, it shall treat
> ill-formed code unit sequences as
Yung-Fong Tang wrote:
> So... in the future, in order to ensure we have a good software
> environment, we not only need to make the Unicode 4.0 clear, but also
> need to speed up the revision of those RFCs.
But the Unicode Consortium and UTC have no control over that. And as
you can see, Franço
age -
From: "Yung-Fong Tang" <[EMAIL PROTECTED]>
To: "Kenneth Whistler" <[EMAIL PROTECTED]>
Cc: <[EMAIL PROTECTED]>
Sent: Thursday, February 27, 2003 13:06
Subject: Re: Unicode 4.0 BETA available for review
> >
> >
> >
> >I can keep answ
Tex Texin asked:
> Hmm, is that true?
Yes, it is true. All the standard *mandates* is what I quoted
previously in this thread:
"C12a When a process interprets a code unit sequence which purports
to be in a Unicode character encoding form, it shall treat
ill-formed code unit sequence
Ken,
Hmm, is that true? Is it ok then, if I detect an unpaired surrogate, mutter
"oops I have an error" and then drop that surrogate and continue processing
the file, resulting in a valid utf-8 file?
I thought for some reason this was prohibited, but if the standard does not
prescribe error handli
[EMAIL PROTECTED] wrote:
> 1. the definitation in Unicode itself (3.0, 3.1)
> 2. the RFC which summarize it.
>
> I am sure you can control the point 1. But we have to understand the
> point 2 is also important. The reasone people refer to point 2 is
> usually the RFC is much shorter and focus th
I can keep answering these questions, but I can also assure
everyone that the UTC worked *very* hard this time around to
make the character encoding model much clearer in the Unicode 4.0
text, and to anticipate all these edge cases.
--Ken
The problem in the past come from two (or more place
Frank Tang responded to Kent Karlsson's response:
> The problem I need to deal with is not GENERATE those UTF-8, but how to
> handle these DATA when my code receive it. For example, when I receive a
> 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8
> sequence in the line 99
Frank Tang asked:
> >> This discussion has been centered around UTF-8. But I hope the
> >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
> >>
> >>. for UTF-32: occurrences of 'surrogates' are ill-formed.
> >>
> >>
> >>
> How about UTF-32 sequence which the 4 bytes represent
Kent Karlsson wrote:
The Unicode 4.0 text further strengthens Conformance Clause
C12, to make this crystal clear:
"C12 When a process generates a code unit sequence which
purports to be in a Unicode character encoding form, it shall
not emit ill-formed code unit sequences.
"C12a Wh
This discussion has been centered around UTF-8. But I hope the
corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0:
. for UTF-32: occurrences of 'surrogates' are ill-formed.
How about UTF-32 sequence which the 4 bytes represent value > U+10 ?
Are they considered ill-formed?
Stefan Persson wrote:
Kenneth Whistler wrote:
Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:
a. 0xC0 0x80 for U+ (instead of 0x00)
b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90
0x80 0x80)
Ah, but encoding NULL as
Kenneth Whistler wrote:
Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value
sequences. There were two types:
a. 0xC0 0x80 for U+ (instead of 0x00)
b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80)
Ah, but encoding NULL as a surrogate character and
Yung-Fong Tang wrote:
I see a hole here. How about UTF-8 representing a paired of surrogate
code point with two 3 octets sequence instead of an one octets UTF-8
sequence? It should be ill-formed since it is non-shortest form also,
right? But we really need to watch out the language used there so
Frank Tang continued:
> >If you read through those definitions from Unicode 4.0 carefully,
> >you will see that UTF-8 representing a noncharacter is perfectly
> >valid, but UTF-8 representing an unpaired surrogate code point
> >is ill-formed (and therefore disallowed).
> >
> I see a hole here. How
Kenneth Whistler wrote:
If you read through those definitions from Unicode 4.0 carefully,
you will see that UTF-8 representing a noncharacter is perfectly
valid, but UTF-8 representing an unpaired surrogate code point
is ill-formed (and therefore disallowed).
I see a hole here. How about UTF-
Frank Tang asked:
> so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F}
> are consider legal in Unicode 4.0
Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2.
The Unicode Standard, Version 3.0 specified, on p. 46:
"To ensure that round-trip transcoding is pos
so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F}
are consider legal in Unicode 4.0
Kenneth Whistler wrote:
Frank Tang asked:
I am working on update the Mozilla UTF-8 code to incooperate the change
of UTF-8 definitation in Unicode 3.1 (make non-shortest fo
Frank Tang asked:
> I am working on update the Mozilla UTF-8 code to incooperate the change
> of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal,
> and make 5-6 octets illegal) and Unicode 3.2 (make irregular form
> illegal) now. I wonder do have any change of the UTF-8 defini
I am working on update the Mozilla UTF-8 code to incooperate the change
of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal,
and make 5-6 octets illegal) and Unicode 3.2 (make irregular form
illegal) now. I wonder do have any change of the UTF-8 definitation from
Unicode 3.2 to
37 matches
Mail list logo