Re: get the sourcecode [of UTF-8]

Asmus Freytag via Unicode Mon, 04 Nov 2024 23:00:59 -0800

On 11/4/2024 2:43 PM, Phil Smith III via Unicode wrote:

I've been watching this from afar and am equally confused. It sounds like OP is 
confusing encoding (the various UTF flavors) with Unicode. I understand 
that--it's a surprisingly confusing thing; weirdly, now that I DO understand 
it, I can't quite see why it's so hard to grok, but I sure do remember that it 
took me and a lot of other, smarter people I know *many* tries to really get it.


The various UTFen are just ways to encode, as others have noted. That means all 
they tell you is that if you're trying to encode--indicate, store, specify--a 
specific Unicode thing (codepoint/character/glyph*) using (say) UTF-8, here are 
the rules for how to do so. There's nothing in UTF that requires that to even 
be a defined thing. If there's an undefined thing in the Unicode spec, you can 
still encode that thing with UTF-8. The encoding says nothing about whether the 
thing will render reasonably or not on a given platform.


Would it help to reference UTR#17 "Unicode Character Encoding Model" ?

A./


Perhaps if OP were to explain what the problem is they're trying to solve, we 
could figure out what the real question is?

If it's this:

I need the bytecode to glyph map of UTF-8 as it is used by my runtime software.

...then I don't think UTF-8 is relevant: what you'd need is to know what 
Unicode version your software conforms to. That tells you the mapping, if the 
data is indeed encoded as UTF-8.

At the risk of being insulting--not my goal!--consider good ol' ASCII and EBCDIC. An 
uppercase letter A in ASCII is a single byte (as is all ASCII), 0x41. In EBCDIC, the same 
letter is 0xC1 (or X'C1', as an EBCDIC person is more likely to write it). Both of those 
are talking about the same thing, but they're different encoding systems. The same 
character in Unicode encoded as UTF-8, because the first 127 entries "just happen to 
be" the same as ASCII, would also be 0x41.

That's the definition of Unicode (which expands but never changes incompatibly, even when mistakes 
are recognized) and how UTF-8 encodes it. But since that encoding is fully defined and stable, 
there's no "source code" needed, nor is there any "bytecode to glyph map of UTF-8 as 
it is used by my runtime software": there's just Unicode, and where a given thing is defined 
in THAT map. The UTF-8 part is then deterministic, no matter the platform.

Does this help? I'm now quite interested in what the real problem is!

...phsiii

*Yes, those are all slightly different in various contexts, but I'm using them 
all together to mean what we all know we're talking about: a single Unicode 
thing)
-----Original Message-----
From: Unicode <[email protected]> On Behalf Of Slawomir Osipiuk 
via Unicode
Sent: Monday, November 4, 2024 4:06 PM
To: A bughunter <[email protected]>; A bughunter via Unicode 
<[email protected]>
Subject: Re: get the sourcecode [of UTF-8]



On Monday, 04 November 2024, 00:43:29 (-05:00), A bughunter via Unicode wrote:

No, it does not answer my question.

I don't think I'm alone in saying that your question is very unclear, in major part by your very 
strange use of certain terms. I don't think I've ever encountered "bytecode" outside of 
Java implementations, and never does it refer to textual (prose) data as you seem to do. I still 
don't know what "compile time UTF-8" is supposed to be, and I've read both your messages 
multiple times.

In order to fully authenticate: the codepage of the character to glyph map must 
be known.

To authenticate what? At the end of the day, you're always just authenticating 
a stream of bits.

I need the bytecode to glyph map of UTF-8 as it is used by my runtime software.

You want to map UTF8-encoded code points to characters? (Glyphs are the visual 
representations of characters, determined by the font.) In that case the "map" 
is the Unicode database.  Each code point (encoded as one or more bytes in UTF8) maps to 
a character. Versions of the database are freely accessible online.

But I am still very unsure of what you're asking for. Are you concerned that code points may be 
reassigned in the future? That, for example, writing "Smith" in version 16 may appear as 
"Smite" in a future version, and this affects the apparent content of a checksummed text 
file? If so, that is prevented by the Unicode Stability Policy; assigned code points cannot have 
their character identity changed. I don't see any practical way of exploiting differences between 
Unicode versions to alter the apparent content of text.

If you wish to checksum a text file encoded in UTF-8, any implementation of a 
well-defined checksum algorithm will work. Your runtime doesn't matter. The 
checksum will be on the bytes of the file. If you must know what version of the 
Unicode Standard was used when creating the file -- and that's a strange thing 
to want -- that would have to be included in the file prior to checksumming it.

That said, I remain confused how the "source code" of anything is supposed to 
help you.

Sławomir

Re: get the sourcecode [of UTF-8]

Reply via email to