Aw: Re: get the sourcecode [of UTF-8]

Marius Spix via Unicode Fri, 08 Nov 2024 09:23:36 -0800

Dear A bughunter,

I am trying to help you. At first, please note that this is a public mailing list, so I won't encrypt my answer using PGP.

I understand your question that you want to convert a UTF-8 encoded text into the character codepoints to generate a checksum.

A Unicode codepoint is a 21 bit unsigned integer ranging from 0x0 to 0x10FFFF.

There are different encodings to represent a Unicode character. The simplest would be UTF-32, which uses 32 bits for each codepoint.

The most common characters (Latin and basic interpunction) are found in the codepoints 0x00 to U+7F, which would only require 7 bits. A document using UTF-32 would contain many zeroes, which would be inefficient and require much memory. Therefore, UTF-8 uses a trick: multi-byte sequences.

Each byte contains 8 bits.

If the most significant bit is 1 (that means, the byte value is > 0x7F), it is either the start of a multi-byte character or the continuation of a multi-byte sequence. The continuation of a multi-byte character always starts with 0b10, that means, the byte value is between 0x80 and 0xBF.

Characters in the range 0x00 to 0x7F are coded as they are. That means: codepoint 0b0xxxxxxx becomes 0b0xxxxxxx

Characters in the range 0x0080 to 0x07FF are coded starting with 0b110. That means, codepoint 0b00000yyy_xxxxxxxx becomes 0b110yyyxx_10xxxxxx

Characters in the range 0x0800 to 0xFFFF are coded starting with 0b1110. That means, codepoint 0byyyyyyyy_xxxxxxxx becomes 0b1110yyyy_10yyyyxx_10xxxxxx

Characters in the range 0x100000 to 0x10FFFF are coded starting with 0b11110. That means, codepoint 0b000zzzzz_yyyyyyyy_xxxxxxxx becomes 0b11110zzz_10zzyyyy_10yyyyxx_10xxxxxx

Now, for example, you encounter the following byte sequence and want to convert it from UTF-8 to the corresponding Unicode code point:

0xF0 0x9F 0x98 0xB8

= 0b11110000 0b10011111 0b10011000 0b10111000

As you see, the sequence starts with 0b11110, which means you have to parse four bytes. The next three bytes start with 0b10 (the continuation sequence), which means, the encoding is valid.

Let us transform this using the mapping from 0b11110zzz_10zzyyyy_10yyyyxx_10xxxxxx to 0b000zzzzz_yyyyyyyy_xxxxxxxx:

This leaves us with 0b00000001_11110110_00111000 = 0x0001F638 = U+1F638 = Grinning Cat Face with Smiling Eyes

There are several libraries which can be used to parse UTF-8 encoded text and split it into the corresponding codepoints. For example, you can use the Java class java.io.InputStreamReader, with the third argument being the String literal "UTF-8".

I hope, that helps you.

Best regards,

Marius

Gesendet: Freitag, 08. November 2024 um 01:36 Uhr
Von: "Markus Scherer via Unicode" <[email protected]>
An: "Jim Breen" <[email protected]>
Cc: "[email protected]" <[email protected]>
Betreff: Re: get the sourcecode [of UTF-8]

On Thu, Nov 7, 2024 at 3:03 PM Jim Breen via Unicode <[email protected]> wrote:

On rare occasions, I need to dig into UTF-8 at the bit level. I have a
note pinned near my desk as an aide memoire. It has 3 lines:

UTF-8
zzzzyyyyyxxxxx
1110zzzz 10yyyyyy 10xxxxxx

11110nnn 10zzzzzz 10yyyyyy 10xxxxxx

markus

Aw: Re: get the sourcecode [of UTF-8]

Reply via email to