Re: get the sourcecode [of UTF-8]

Giacomo Catenazzi via Unicode Mon, 04 Nov 2024 01:33:23 -0800

On 2024-11-03 23:42, Jim DeLaHunt via Unicode wrote:

Hello, anonymous person:
On 2024-11-02 17:42, A bughunter via Unicode wrote:
Where to get the sourcecode of relevent (version) UTF-8?: in order tochecksum text against the specific encoding map (codepage).
from [email protected]
I'm afraid I don't really understand what you are asking here.
UTF-8 is a data format, a way of representing 21-bit Unicode scalarintegers in 1, 2, 3, or 4 bytes (octets). It is defined in section2.5.3, "UTF-8"<https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G11165>,of the Core Specification of the Unicode Standard. It has not changedover time, so it doesn't really have versions.
If by "source code" you refer to an implementation of the UTF-8format, then is no single answer. There are multiple implementationsof UTF-8, and so multiple independent bodies of "source code".
And there are many things which could be called a "specific encodingmap (codepage)". I don't know which of those you are referring to.

Checksum may be tricky (interpreting the question). The more obviousproblem is new line, some variants are encoded with CR or CR+LF, or LF.Programs may translate them so for checking sum text, you may need tonormalize.

But then we have additional *problems* of Unicode: there may be morethen one form to encode the same character: as example: accentedcharacters may be encoded as one character, or two: base character and acombining diacritic (accent), e.g. Apple prefer the latter, andMicrosoft the first. So it depends on your encoding map preference (andpossibly further normalization). We may argue that the short one shouldbe better (in this case): one of task of Unicode was to map common used(and also less used) encodings with a single Unicode character (sohinting a preference for encoding mapping). So for a checksum, you mayneed to agree on a normalized form, and that unfortunately may depend onUnicode version (or better: a code written with new Unicode charactermay not be correctly normalized with older programs.

Note: overlong UTF-8 encoding are not considered valid (so encoding aUnicode character not using the minimal length UTF-8 sequence). But thatshould be caught before (but with a checksum, care should be done, elsethis special case (as many others) may be abused (often a grave securityissue). So it is complex, and your question is too vague (and imprecise)to help.

I recommend you to look existing implementations: PGP (and GPG) protocolmay give some hints on securely doing checksum of text.


giacomo

Re: get the sourcecode [of UTF-8]

Reply via email to