4 actually, 10FFFF needs four bytes: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx 10FFFF = 00001010 11111111 11111111
Fill in the blanks, starting from the bottom, you get: 11110000 10101111 10111111 10111111 Regards, John Hansen -----Original Message----- From: Christopher Kings-Lynne [mailto:[EMAIL PROTECTED] Sent: Saturday, August 07, 2004 8:47 PM To: Tom Lane Cc: John Hansen; Hackers; Patches Subject: Re: [HACKERS] UNICODE characters above 0x10000 > Now it's entirely possible that the underlying support is a few bricks > shy of a load --- for instance I see that pg_utf_mblen thinks there > are no UTF8 codes longer than 3 bytes whereas your code goes to 4. > I'm not an expert on this stuff, so I don't know what the UTF8 spec > actually says. But I do think you are fixing the code at the wrong level. Surely there are UTF-8 codes that are at least 3 bytes. I have a _vague_ recollection that you have to keep escaping and escaping to get up to like 4 bytes for some asian code points? Chris ---------------------------(end of broadcast)--------------------------- TIP 8: explain analyze is your friend