> I have a form that posts diacritical characters.Ffor example, when my browser
> has the encoding set to utf-8 and the form posts the character �
> the post data has these two bytes C3 and 89, which when echoed back on a new
> page is displayed as �?.  Can someone explain when the character is converted
> to two bytes how I get C3 and 89?
> 

UTF-8 is explained in section 3.9 of the Unicode standard and elsewhere (RFC 2279 is a 
heavily-referenced document, note that its description includes the encoding of 
codepoints outside of the Unicode range).

� is U+00C9 and in binary that is:

0000000011001001

UTF-8 encoding results in different numbers of bytes depending on how many bits you 
have when you remove the leading zeros (8 bits in this case - resulting in two bytes).

It then puts those bits from the codepoint into bytes as so:


00000000 0xxxxxxx -> 0xxxxxxx
00000yyy yyxxxxxx -> 110yyyyy 10xxxxxx
zzzzyyyy yyxxxxxx -> 1110zzzz 10yyyyyy 10xxxxxx
000uuuuu zzzzyyyy yyxxxxxx -> 11110uuu 10uuzzzz 10yyyyyy 10xxxxxx

In the case of U+00C9 the second of these is the shortest form possible, so it is 
used. The bits 00011 are placed in 110yyyyy to give you 11000011 (0xC3) and the bits 
001001 are placed in 10xxxxx to give you 10001001 (0x89).

The problem is that this didn't happen when the bytes went back out again - rather the 
bytes where interpreted as being part of a string encoded in some other way (most 
likely ISO 8859-1, which certainly would produce � followed by a control character 
from those bytes). It may be that all you need to do is to correctly report the 
encoding, by sending a HTTP header of the mime-type and charset (some server-side APIs 
make this easy, e.g. in ASP you would use Response.Charset = "utf-8"). It may be that 
you need to do futher work (depending on just what it is you are doing with the form).





Reply via email to