Yes, but the specification allows for 6byte sequences, or 32bit characters. As dennis pointed out, just because they're not used, doesn't mean we should not allow them to be stored, since there might me someone using the high ranges for a private character set, which could very well be included in the specification some day.
Regards, John Hansen -----Original Message----- From: Tatsuo Ishii [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:09 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED]; John Hansen; [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000 > Dennis Bjorklund <[EMAIL PROTECTED]> writes: > > ... This also means that the start byte can never start with 7 or 8 > > ones, that is illegal and should be tested for and rejected. So the > > longest utf-8 sequence is 6 bytes (and the longest character needs 4 > > bytes (or 31 bits)). > > Tatsuo would know more about this than me, but it looks from here like > our coding was originally designed to support only 16-bit-wide > internal characters (ie, 16-bit pg_wchar datatype width). I believe > that the regex library limitation here is gone, and that as far as > that library is concerned we could assume a 32-bit internal character > width. The question at hand is whether we can support 32-bit > characters or not --- and if not, what's the next bug to fix? pg_wchar has been already 32-bit datatype. However I doubt there's actually a need for 32-but width character sets. Even Unicode only uese up 0x0010FFFF, so 24-bit should be enough... -- Tatsuo Ishii ---------------------------(end of broadcast)--------------------------- TIP 7: don't forget to increase your free space map settings