Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
At 01:07 PM 3/3/03 -0800, Mark Davis wrote: If your converter purports to produce any one of the Unicode encoding forms, then it cannot conformantly produce malformed Unicode as a result. If, of course, it does not purport to do that, it can do anything it wants to. Then, as long as the documentat

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]> Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, March 03, 2003 12:17 Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
t;; "Kent Karlsson" <[EMAIL PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]> Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Monday, March 03, 2003 11:21 Subject: Re: UTF-8 Error Handling (was: Re:

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
At 11:52 AM 3/3/03 -0800, Mark Davis wrote: Perhaps I wasn't clear; I agree with you on that. 1) It is conformant to skip or substitute text, with just a code at the end indicating that something of that sort was done. It's a subtle point, but can be put into your formulation: What I was after is

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Asmus Freytag
"Mark Davis" <[EMAIL PROTECTED]>; "Kent Karlsson" <[EMAIL PROTECTED]>; "'Michael (michka) Kaplan'" <[EMAIL PROTECTED]> Cc: "'Yung-Fong Tang'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, March 02,

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Mark Davis
ng'" <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, March 02, 2003 21:10 Subject: Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) > At 07:21 AM 3/2/03 -0800, Mark Davis wrote: > > >"C12a When a process interprets a code

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-03 Thread Markus Scherer
I am not sure yet how far I want to get into this discussion... but this seems worth mentioning: Asmus Freytag wrote: The ideal case is one where the converter stops in a restartable configuration, allowing the client to implement (or ask for) a variety of error-recovery options. A nice descript

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Asmus Freytag
At 07:21 AM 3/2/03 -0800, Mark Davis wrote: >"C12a When a process interprets a code unit sequence which > purports to be in a Unicode character encoding form, it > shall treat ill-formed code unit sequences as an error > condition, and shall not interpret such sequences as > cha

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Michael \(michka\) Kaplan
From: "Mark Davis" <[EMAIL PROTECTED]> > I agree with Kent that it is somewhat less robust to simply remove > ill-formed sequences, since it removes any indication that the data was > corrupted. Nice that the API gives one the option to choose, huh? ;-) The notion of continuing (even if one is

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Mark Davis
lt;[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Sunday, March 02, 2003 02:00 Subject: RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review) > > > Michael (michka) Kaplan: > ... > > then the conversion will simply strip the errant characters. No

RE: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-03-02 Thread Kent Karlsson
Michael (michka) Kaplan: ... > then the conversion will simply strip the errant characters. Note that > either solution meets the needs of refusal to interpret the errant > sequences. Simply stripping the errant byte sequences means that they are each interpreted as the empty string of character

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-28 Thread Michael \(michka\) Kaplan
From: "Yung-Fong Tang" <[EMAIL PROTECTED]> > When you deal with encoding which need states (ISO-2022, ISO-2022-JP, > etc) or variable length encoding (Shift_JIS, Big5, UTF-8), then the > situration is different. Unicode cannot of course speak for those other encodings, but it can speak for UTF-8.

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-28 Thread Yung-Fong Tang
Kenneth Whistler wrote: Think of it this way. Does anyone expect the ASCII standard to tell, in detail, what a process should or should not do if it receives data which purports to be ASCII, but which contains an 0x80 byte in it? All the ASCII standard can really do is tell you that 0x80 is not

Re: Unicode 4.0 BETA available for review

2003-02-28 Thread Yung-Fong Tang
Doug Ewell wrote: Yung-Fong Tang wrote: So... in the future, in order to ensure we have a good software environment, we not only need to make the Unicode 4.0 clear, but also need to speed up the revision of those RFCs. But the Unicode Consortium and UTC have no contr

Re: Unicode 4.0 BETA available for review

2003-02-28 Thread Yung-Fong Tang
Thanks to let me know. I guess I didn't spend enugh time with www.unicode.org these days :) when do you add those PDF there ? It used to have only partial sesssion available... but that is probably story several years ago Roozbeh Pournader wrote: On Thu, 27 Feb 2003, Mark Davis wrote:

Re: Unicode 4.0 BETA available for review

2003-02-28 Thread Roozbeh Pournader
On Thu, 27 Feb 2003, Mark Davis wrote: > The Unicode Standard *is* free of charge; the entire text is posted on > www.unicode.org. Well, free of charge to *read personally on the screen*, of course. You can't print the major versions yourself, Addison-Wesley must be asked for that ;) And you can'

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Kenneth Whistler
Stefan Persson suggested: > >Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value > >sequences. There were two types: > > > > a. 0xC0 0x80 for U+ (instead of 0x00) > > b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80) > > > > > Ah, but encoding NUL

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Tex Texin
Kenneth Whistler wrote: > Yes, it is true. All the standard *mandates* is what I quoted > previously in this thread: > > "C12a When a process interprets a code unit sequence which purports > to be in a Unicode character encoding form, it shall treat > ill-formed code unit sequences as

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Doug Ewell
Yung-Fong Tang wrote: > So... in the future, in order to ensure we have a good software > environment, we not only need to make the Unicode 4.0 clear, but also > need to speed up the revision of those RFCs. But the Unicode Consortium and UTC have no control over that. And as you can see, Franço

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Mark Davis
age - From: "Yung-Fong Tang" <[EMAIL PROTECTED]> To: "Kenneth Whistler" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Thursday, February 27, 2003 13:06 Subject: Re: Unicode 4.0 BETA available for review > > > > > > > >I can keep answ

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
Tex Texin asked: > Hmm, is that true? Yes, it is true. All the standard *mandates* is what I quoted previously in this thread: "C12a When a process interprets a code unit sequence which purports to be in a Unicode character encoding form, it shall treat ill-formed code unit sequence

Re: UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Tex Texin
Ken, Hmm, is that true? Is it ok then, if I detect an unpaired surrogate, mutter "oops I have an error" and then drop that surrogate and continue processing the file, resulting in a valid utf-8 file? I thought for some reason this was prohibited, but if the standard does not prescribe error handli

RFC2279bis (was RE: Unicode 4.0 BETA available for review)

2003-02-27 Thread Francois Yergeau
[EMAIL PROTECTED] wrote: > 1. the definitation in Unicode itself (3.0, 3.1) > 2. the RFC which summarize it. > > I am sure you can control the point 1. But we have to understand the > point 2 is also important. The reasone people refer to point 2 is > usually the RFC is much shorter and focus th

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang
I can keep answering these questions, but I can also assure everyone that the UTC worked *very* hard this time around to make the character encoding model much clearer in the Unicode 4.0 text, and to anticipate all these edge cases. --Ken The problem in the past come from two (or more place

UTF-8 Error Handling (was: Re: Unicode 4.0 BETA available for review)

2003-02-27 Thread Kenneth Whistler
Frank Tang responded to Kent Karlsson's response: > The problem I need to deal with is not GENERATE those UTF-8, but how to > handle these DATA when my code receive it. For example, when I receive a > 10K UTF-8 file which have 1000 lines of text, if there are one UTF-8 > sequence in the line 99

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Kenneth Whistler
Frank Tang asked: > >> This discussion has been centered around UTF-8. But I hope the > >>corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0: > >> > >>. for UTF-32: occurrences of 'surrogates' are ill-formed. > >> > >> > >> > How about UTF-32 sequence which the 4 bytes represent

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang
Kent Karlsson wrote: The Unicode 4.0 text further strengthens Conformance Clause C12, to make this crystal clear: "C12 When a process generates a code unit sequence which purports to be in a Unicode character encoding form, it shall not emit ill-formed code unit sequences. "C12a Wh

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang
This discussion has been centered around UTF-8. But I hope the corresponding rules apply to UTF-16 and UTF-32 for Unicode 4.0: . for UTF-32: occurrences of 'surrogates' are ill-formed. How about UTF-32 sequence which the 4 bytes represent value > U+10 ? Are they considered ill-formed?

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Yung-Fong Tang
Stefan Persson wrote: Kenneth Whistler wrote: Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value sequences. There were two types: a. 0xC0 0x80 for U+ (instead of 0x00) b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80) Ah, but encoding NULL as

Re: Unicode 4.0 BETA available for review

2003-02-27 Thread Stefan Persson
Kenneth Whistler wrote: Unicode 3.0 defined non-shorted UTF-8 as *irregular* code value sequences. There were two types: a. 0xC0 0x80 for U+ (instead of 0x00) b. 0xED 0xA0 0x80 0xED 0xB0 0x80 for U+1 (instead of 0xF0 0x90 0x80 0x80) Ah, but encoding NULL as a surrogate character and

Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Markus Scherer
Yung-Fong Tang wrote: I see a hole here. How about UTF-8 representing a paired of surrogate code point with two 3 octets sequence instead of an one octets UTF-8 sequence? It should be ill-formed since it is non-shortest form also, right? But we really need to watch out the language used there so

Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Kenneth Whistler
Frank Tang continued: > >If you read through those definitions from Unicode 4.0 carefully, > >you will see that UTF-8 representing a noncharacter is perfectly > >valid, but UTF-8 representing an unpaired surrogate code point > >is ill-formed (and therefore disallowed). > > > I see a hole here. How

Re: Unicode 4.0 BETA available for review

2003-02-26 Thread Yung-Fong Tang
Kenneth Whistler wrote: If you read through those definitions from Unicode 4.0 carefully, you will see that UTF-8 representing a noncharacter is perfectly valid, but UTF-8 representing an unpaired surrogate code point is ill-formed (and therefore disallowed). I see a hole here. How about UTF-

Re: Unicode 4.0 BETA available for review

2003-02-25 Thread Kenneth Whistler
Frank Tang asked: > so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F} > are consider legal in Unicode 4.0 Yes. Such sequences are also legal in Unicode 3.0, 3.1, and 3.2. The Unicode Standard, Version 3.0 specified, on p. 46: "To ensure that round-trip transcoding is pos

Re: Unicode 4.0 BETA available for review

2003-02-25 Thread Yung-Fong Tang
so the UTF-8 sequence which represent U+FFFE U+ and U+{1-11}FFF{E,F} are consider legal in Unicode 4.0 Kenneth Whistler wrote: Frank Tang asked: I am working on update the Mozilla UTF-8 code to incooperate the change of UTF-8 definitation in Unicode 3.1 (make non-shortest fo

Re: Unicode 4.0 BETA available for review

2003-02-24 Thread Kenneth Whistler
Frank Tang asked: > I am working on update the Mozilla UTF-8 code to incooperate the change > of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal, > and make 5-6 octets illegal) and Unicode 3.2 (make irregular form > illegal) now. I wonder do have any change of the UTF-8 defini

Re: Unicode 4.0 BETA available for review

2003-02-24 Thread Yung-Fong Tang
I am working on update the Mozilla UTF-8 code to incooperate the change of UTF-8 definitation in Unicode 3.1 (make non-shortest form illegal, and make 5-6 octets illegal) and Unicode 3.2 (make irregular form illegal) now. I wonder do have any change of the UTF-8 definitation from Unicode 3.2 to