On 2024-11-03 23:42, Jim DeLaHunt via Unicode wrote:
Hello, anonymous person:

On 2024-11-02 17:42, A bughunter via Unicode wrote:

Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum text against the specific encoding map (codepage).

from [email protected]

I'm afraid I don't really understand what you are asking here.

UTF-8 is a data format, a way of representing 21-bit Unicode scalar integers in 1, 2, 3, or 4 bytes (octets). It is defined in section 2.5.3, "UTF-8" <https://www.unicode.org/versions/Unicode16.0.0/core-spec/chapter-2/#G11165>, of the Core Specification of the Unicode Standard. It has not changed over time, so it doesn't really have versions.

If by "source code" you refer to an implementation of the UTF-8 format, then is no single answer. There are multiple implementations of UTF-8, and so multiple independent bodies of "source code".

And there are many things which could be called a "specific encoding map (codepage)". I don't know which of those you are referring to.

Checksum may be tricky (interpreting the question). The more obvious problem is new line, some variants are encoded with CR or CR+LF, or LF. Programs may translate them so for checking sum text, you may need to normalize.

But then we have additional *problems* of Unicode: there may be more then one form to encode the same character: as example: accented characters may be encoded as one character, or two: base character and a combining diacritic (accent), e.g. Apple prefer the latter, and Microsoft the first. So it depends on your encoding map preference (and possibly further normalization). We may argue that the short one should be better (in this case): one of task of Unicode was to map common used (and also less used) encodings with a single Unicode character (so hinting a preference for encoding mapping). So for a checksum, you may need to agree on a normalized form, and that unfortunately may depend on Unicode version (or better: a code written with new Unicode character may not be correctly normalized with older programs.

Note: overlong UTF-8 encoding are not considered valid (so encoding a Unicode character not using the minimal length UTF-8 sequence). But that should be caught before (but with a checksum, care should be done, else this special case (as many others) may be abused (often a grave security issue). So it is complex, and your question is too vague (and imprecise) to help.

I recommend you to look existing implementations: PGP (and GPG) protocol may give some hints on securely doing checksum of text.

giacomo


Reply via email to