RE: get the sourcecode [of UTF-8]

Phil Smith III via Unicode Wed, 06 Nov 2024 19:48:44 -0800

Dude, you're not getting it. You still don't understand what Unicode or 
encoding are. Arguments like the following demonstrate this:
>Yet another cause of why one need's the sourcecode of the system 
>encoding the UTF-8: "In Unicode, some characters may be represented in 
>various ways"


That has nothing to do with UTF-8 per se. It has to do with which Unicode 
codepoints are being represented. Which is not strictly the same as the glyphs, 
since (as others have explained) those different Unicode representations of the 
same character mean there's more than one UTF-8 representation. That's where 
normalization comes in.

We're back to "Source code for WHAT?" A UTF-8 implementation is generic. You 
can find one anywhere, I'm sure, and modulo bugs, they're all the same. The 
spec is simple enough that there shouldn't be any (and it's easy enough, if 
tedious, to test against all of Unicode, so there really shouldn't be any). So 
having source code for a UTF-8 encoder doesn’t seem particularly useful.

And as others have stated, there are no codepages in Unicode. It's all one 
codepage, is one way to look at it. So "in order to checksum text against the 
specific encoding map (codepage)" also makes no sense.

This might be key to what you're trying to figure out:
> this assumes my runtime matches a generic AOSP android 13 source

Since UTF-8 is rigidly defined, if you're worried about two UTF-8 
implementations differing, don't be. If one is different, it's flat-out wrong.

Again, UTF-8 is not Unicode, any more than ASCII "is" the glyphs for A-Z: 
ASCII, like UTF-8, is one way to *encode* the characters A-Z. What may cause 
some confusion is that there are multiple ASCII code pages: but since there's 
only one Unicode, there's no variation between Unicode code points--value +xxxx 
only ever means one thing, ever. At worst you have an 
application/environment/font that cannot render that +xxxx character--look at 
some Wikipedia pages about Unicode and you'll find lots of the infamous "empty 
squares" that are used in most browsers/OSes to show characters that are cannot 
be rendered. But they're still those characters, encoded with UTF-8.

I keep rereading your posts, trying to guess what problem you're really trying 
to solve. These: 
>checksum text against the specific encoding map (codepage)
and
>In order to fully authenticate: the codepage of the character to glyph 
>map must be known
-- are you trying to see if characters are all valid in a given code page? 
Since code pages don't exist in Unicode, that's not necessary/meaningful.

Maybe what you mean is that you're trying to see if a given byte string is 
valid UTF-8? That's a possible thing to want to do--there are plenty of random 
byte sequences that are not valid UTF-8. Could that be it?

-----Original Message-----
From: Unicode <[email protected]> On Behalf Of A bughunter via 
Unicode
Sent: Wednesday, November 6, 2024 9:46 PM
To: Otto Stolz <[email protected]>
Cc: [email protected]
Subject: Re: get the sourcecode [of UTF-8]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA512

My reply to Otto is interspersed.
Originating Question

Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum 
text against the specific encoding map (codepage).

from [email protected]

Sent with Proton Mail secure email.

On Wednesday, November 6th, 2024 at 17:19, Otto Stolz 
<[email protected]> wrote:

> Hello bughunter,

I did look over the FAQ and did not find an answer to this question. 
Definitions and terms are not a problem as I layed them out in my second post 
you would see in the mailing list archive here 
https://corp.unicode.org/pipermail/unicode/2024-November/011102.html
> before wording a question to any discussion group, it is recommended 
> to read (and understand) the pertinent FAQ list; otherwise the ensuing 
> discussion will focus on definitions and terms rather than the problem 
> at hand. You may start reading at https://www.unicode.org/faq/.
It would seem the only problem we have with Definitions & Terms is others' 
inability to read and speak English. I furthered that Definitions & Terms are 
not a problem in my reply to Jim here 
https://corp.unicode.org/pipermail/unicode/2024-November/011117.html

The question of mine "Where to get the sourcecode" is  absolutely clear because 
if I install a Debian FOSS only system producing UTF-8 texts then there must be 
sourcode. The problem (of checksums)  which may not be clear to 'You' is only 
mentioned as a use-case and is not fully onpoint of this mailing list. Focus on 
this question.
> That said, I’ll try to answer your question. As your problem is not 
> quiet clear, you’ll get basically three answers, and a technical hint 
> pertaining to two of them.
> 
> You have asked::
> 
> > Where to get the sourcecode of relevent (version) UTF-8?:
> > in order to checksum text against the specific encoding map (codepage).
> 
More about my use-case you will find my statement " In order to fully 
authenticate: the codepage of the character to glyph map must be known." in my 
second post here 
https://corp.unicode.org/pipermail/unicode/2024-November/011102.html the 
use-case is certainly worth mentioning because "in order to"- the source must 
be known. I restated for Slawomir to focus on My maxim in my reply to Slawomir 
here https://corp.unicode.org/pipermail/unicode/2024-November/011111.html 
saying " the only part you need to focus on to answer the originating question 
is: "the character to glyph map must be known."" because my query to the 
mailing list a prerequisit for my use-case. I showed the use-case is intended 
only to accentuate the need to answer the question. 


What I do with the sourcecode or how I will proceed to checksum text has no 
bearing on answering: "Where to get the sourcecode".
> My answer depends on the purpose of the checksum.
> 
> UTF-8 is one method (of a handfull of standardized methods) to 
> represent Unicode text at the bit level in order to conveniently 
> transfer, or store, it. If the intend of your checksum is merely to 
> protect against transmission error, or tempering, then you would 
> simply checksum this bit-level representation of the text – no 
> knowledge of Unicode, or UTFs, is required to achieve this goal.
> 
Protect against corruption and tampering are two fine points for the use of 
checksums. However for you to say  "If my intent is against tampering no 
knowledge of UTF is required" is not quite true. And I can carry on this 
discussion with you on my GitHub page here https://github.com/freedom-foundation

> A Unicode code point is a number in the range from 0 to 1 114 111; a 
> Unicode text is a sequence of Unicode code points.
> On the bit level, you can represent that sequence in various ways, cf. 
> https://www.unicode.org/faq/utf_bom.html. Hence, if you
> 
> want to compare two Unicode texts that are represented in arbitrary 
> bit-level representations (UTFs), then you would convert those to the 
> same UTF (preferably UTF-32) and checksum those. (UTF-32 stores the 21 
> bits needed to represent a Unicode code point in one 32 bit wide 
> storage location, leaving 11 bits unused.)
 
Yet another cause of why one need's the sourcecode of the system encoding the 
UTF-8: "In Unicode, some characters may be represented in various ways" and 
this is why I did mention the use-case of checksum text'
> In Unicode, some characters may be represented in various ways; e. g. 
> an “é” can be coded as one single Unicode code point, viz.
> U+00E9, LATIN SMALL LETTER E WITH ACUTE, or, alternatively, as
> a pair of Unicode code points, viz. U+0065 U+0301 LATIN SMALL LETTER E
> + COMBINING ACUTE ACCENT. To cope with ambiguities of this kind,
> Unicode defines those two representations as “canonically equivalent”, 
> i. e., they are to be treated in every respect as equivalent and 
> interchangeable, for details, cf.
> https://www.unicode.org/faq/normalization.html. Hence,
> 

> if you want to check that two Unicode texts are canonically 
> equivalent, you would first convert them to UTF-32, then ‘normalize’ 
> them (i. e. choose consistently the same representation for all 
> instances of canonically equivalent encodings), then checksum the 
> normalized representations.

Yes I had noticed ICU however I need to look at whatever is actually being 
used. However ICU is sidelined and does not apply to what the users are 
actually using.
> You were asking for source code, but the better way to do conversion 
> and normalizations is by using an established and well-tested program 
> library, such as ICU, cf. https://icu.unicode.org/#h.i33fakvpjb7o.
 
You seem so eager to tackle the extra problem. Keep in mind the checksum 
problem is "extra unicode" meaning it is outside of the scope of the Unicode 
consortium mailing list. My query here is a sideline to my GitHub repo 
Unicode_map you may see here https://github.com/freedom-foundation/unicode_map 
and you may see my Proof of Concept 
https://github.com/freedom-foundation/unicode_map?tab=readme-ov-file#proof-of-concept
 if you would like to discuss the fine points of checksumming 
> Good luck with your project,
> Otto
-----BEGIN PGP SIGNATURE-----
Version: ProtonMail

wnUEARYKACcFgmcsKd4JkKkWZTlQrvKZFiEEZlQIBcAycZ2lO9z2qRZlOVCu
8pkAAOnHAQDWuVPE6hwEEMJrQrxAhUMsG//MOlpeCLAh7f2WfJ/WWwD+KaIt
zv26ZfCHgIIAkYfGJ1INIn4bq2wVxAruJnvW7Qo=
=qXm7
-----END PGP SIGNATURE-----

RE: get the sourcecode [of UTF-8]

Reply via email to