RE: UTF-8 validation rules

2001-09-11 Thread Marco Cimarosti
David Hopwood wrote: (about range U+FDD0..U+FDEF) > It's for Arabic presentation forms internal to a rendering > implementation, > I assume (although it's not clear why existing private-use characters > couldn't have been used for that). Where could I found more information about this range and

RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown
David, > > It's for Arabic presentation forms internal to a rendering implementation, > I assume (although it's not clear why existing private-use characters > couldn't have been used for that). > Now I remember. Thanks, Carl

Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler
David Hopwood said: > > > > With Unicode 3.2 (in the works), the 32 additional code points > > at U+FDD0..U+FDEF go from unallocated status to noncharacters > > as well. > > Those are non-characters in Unicode 3.1 (see D7b in UAX #27). Yes, I stand corrected. They are *already* approved by the

Re: UTF-8 validation rules

2001-09-10 Thread David Starner
On Mon, Sep 10, 2001 at 12:22:20AM +0100, David Hopwood wrote: > It's for Arabic presentation forms internal to a rendering implementation, > I assume (although it's not clear why existing private-use characters > couldn't have been used for that). Because if the implementation uses them, then th

Re: UTF-8 validation rules

2001-09-10 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- Kenneth Whistler wrote: > Carl, > > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters. > > In current parlance (see Unicode 3.1, UAX #27), these are > "noncharacters", and you must account for the fact that > U+1FFFE..U+1 > U+2FFFE..U+2 > ...

Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler
> Also, if you're converting to, say, UTF-16, then non-character sequences > like \xEF\xBF\xBE and \xEF\xBF\xBF should probably be converted to the > corresponding UTF-16 non-characters (\uFFFE and \u), rather than being > rejected. (Note: Unicode 3.1 and ISO/IEC 10646-1:2000 differ on this p

Re: UTF-8 validation rules

2001-09-10 Thread David Hopwood
-BEGIN PGP SIGNED MESSAGE- "Carl W. Brown" wrote: > I am checking out my UTF-8 validation rules to see if they are correct. > > Check each character to be a valid UTF-8 initial character. > > \x00 to \x7f or \xC2 to \xF4 > > Allow invalid forms su

RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown
Ken, > > With Unicode 3.2 (in the works), the 32 additional code points > at U+FDD0..U+FDEF go from unallocated status to noncharacters > as well. > Interesting. I have seen some of the proposed characters but nothing on non-characters. It seems like an interesting range for non-characters. C

RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown
Ken, > -Original Message- > From: Kenneth Whistler [mailto:[EMAIL PROTECTED]] > Sent: Monday, September 10, 2001 12:48 PM > To: [EMAIL PROTECTED] > Cc: [EMAIL PROTECTED] > Subject: Re: UTF-8 validation rules > > > Carl, > > > > > \xEF\xB

Re: UTF-8 validation rules

2001-09-10 Thread Kenneth Whistler
Carl, > > \xEF\xBF\xBE and \xEF\xBF\xBF are invalid Unicode characters. In current parlance (see Unicode 3.1, UAX #27), these are "noncharacters", and you must account for the fact that U+1FFFE..U+1 U+2FFFE..U+2 ... U+10FFFE..U+10 all have the same status as noncharacters. With Un

RE: UTF-8 validation rules

2001-09-10 Thread Carl W. Brown
Misha, > You seem to be using the word "character" in some places where > you (probably) mean "byte", eg: > I am getting fuzzy headed these days. Thanks for pointing it out. It should read: > > I am checking out my UTF-8 validation rules to see if t

Re: UTF-8 validation rules

2001-09-10 Thread Misha . Wolf
Carl, You seem to be using the word "character" in some places where you (probably) mean "byte", eg: > All UTF-8 characters must be followed by the proper number of valid > continuation characters, if any. Misha On 10/09/2001 18:21:48 Carl W. Brown wrote: &g

UTF-8 validation rules

2001-09-10 Thread Carl W. Brown
I am checking out my UTF-8 validation rules to see if they are correct. Check each character to be a valid UTF-8 initial character. \x00 to \x7f or \xC2 to \xF4 Allow invalid forms such as \xC0 & \xC1 to decode but consider them invalid. A first byte of \xE0 or \xF0 with a second byte