Welcome to UTF-8.

This is something I consult on all the time. The days that encoding
length equaled character size length and even equaled representation
length are long gone. It's something you have to break your mind of
(and it doesn't help that languages like C and C++ call a byte a
"char".

1 character can count anywhere from 1 to 5 bytes in some cases.

Basicly:
U+000000 to U+00007F (basic Latin) = 1 byte - the graceful part of
UTF-8 is that it is directly equivalent to ASCII in that range.
U+000080 to U+0007FF - 2 bytes
U+000800 to U+00FFFF - 3 bytes
U+010000 to U+10FFFF - 4 bytes
etc...

See: http://en.wikipedia.org/wiki/UTF-8

Zac Bowling
http://zbowling.com/


On Jan 7, 7:39 pm, benjackson <bhjack...@gmail.com> wrote:
> Just sent out the following tweet through the API:
>
> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de
> usuários. E também o API, que cercou o serviço de ferramentas
> interessantes
>
> The international characters are being counted more than once and the
> tweet shows up as:
>
> @gabrielemcrise acho que é um misto de pioneirismo +hype+base de
> usuários. E também o API, que cercou o serviço de ferramentas inter

Reply via email to