Re: get the sourcecode [of UTF-8]

Jim DeLaHunt via Unicode Wed, 06 Nov 2024 22:05:41 -0800


On 2024-11-06 18:45, A bughunter via Unicode wrote:

…My query here is a sideline to my GitHub repo Unicode_map you may seehere https://github.com/freedom-foundation/unicode_map and you may seemy Proof of Concepthttps://github.com/freedom-foundation/unicode_map?tab=readme-ov-file#proof-of-conceptif you would like to discuss the fine points of checksumming

Thank you for posting that link. It provides me a hint of what you wantto do.

What I see in the repo are various representations of historicaldocuments from the 18th century, which were originally produced asEnglish language text hand-written on parchment with pen and ink. Youhave images of the text on physical pages, and the character content ofthe texts in UTF-8. You write there,

These documents have priority to be archived in both ASCII wordcountedand checksummed text and PDF/A-1 archive format for long termpreservation then signed and attested, garunteed to be veritable tocorresponding printed replica documents.… I’m interested in sourcecodeand libre-sourcecode. Libre-sourcecode being defined as allof amachine specification (chip design), compiler and applicationsourcecode which can be written out in respective computer programminglanguages and archived, saved, transmit, reproduced, and build and runall from paper written legit.…

Source: <https://github.com/freedom-foundation>

Your original first two messages said,

Where to get the sourcecode of relevent (version) UTF-8?: in order to checksum 
text against the specific encoding map (codepage).

Source:<https://corp.unicode.org/pipermail/unicode/2024-November/011099.html>

what implimentation of UTF-8: the answer is the relevent implimentation is 
android 13 libbionic (bionic C) which uses UTF-8.…
…android 13 is open source AOSP and, it would be possible to point out the 
exact unicode used in it however this
assumes my runtime matches a generic AOSP android 13 source. So then the way in 
which I framed my question does probe
as to if there is any way to display the compile time UTF-8. Sometimes there 
are --version options.
The part you do not seem to understand is the full circle of authentication of 
a checksummed text. In order to fully
authenticate: the codepage of the character to glyph map must be known. 
Anything further on this checksumming process
would not be directly on topic of this mailing list

Source:<https://corp.unicode.org/pipermail/unicode/2024-November/011102.html>

Put in conventional terminology of text processing and display insoftware systems, it seems that you want to preserve historicaldocuments in digital form. This digital form include an expansive swathof the software stack: not just document content, but also severallayers of software and hardware necessary to present the document. Aspart of this, you want to calculate some sort of robust digest of thedigital form, to let a receiver of the document assure themselves thatwhat they see (experience) when viewing the digital form of the documenthas the same relationship to the original document which you had whenyou authored the digital form.

One part of your software stack is similar to, but not necessarily thesame as, the Android Open Source Project's libbionic (an implementationof libc).

You are looking for the source code for the part of your library whichprocesses character codes in UTF-8 form, believing that this source codewill show you how UTF-8 code units processed by that library will end updisplayed as "glyphs" <https://unicode.org/glossary/#glyph> on a displaysurface. You want to capture this relationship between code unites andglyphs as part of your robust digest. You expect that the answer will besimple enough that a single email to a list will result in a simplereply which gives you what you seek.

I did a little web searching, and I think I can point you to some placeswhere libbionic<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc>processes code units in UTF-8 form. The source code uses the tags "mb",short for "multi-byte", and "wc", short for "wide character", in thenames of functions which operate on UTF-8 code unit data and Unicodescalar values respectively. Take a look at:

function mbsnrtowcs()<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/wchar.cpp#68>

function mbrtc32()<https://android.googlesource.com/platform/bionic/+/refs/heads/main/libc/bionic/mbrtoc32.cpp#36>

I imagine you will find these unsatisfying. They implement the UTF-8data conversions with no mention of either UTF-8 version or Unicodeversion. Nor do they mention glyphs, fonts, character to glyph mapping,or any of the other text rendering complexity which it seems you want tocharacterise.

I have the impression that you are trying to reinvent a whole lot ofwork in text representation, text display, digital documentpreservation, archiving, and software preservation, without having yettaking the time to learn about existing work in the fields. If yourintent is to preserve 18th century hand-written documents well, Isuggest you start by representing them as well-crafted PDF/A files. Youcould perhaps get a PhD in digital archiving and still not exhaust allthe implications of what I think you are asking.


Good luck with your project! Best regards,
      —Jim DeLaHunt

--
.   --Jim DeLaHunt, [email protected] http://blog.jdlh.com/ (http://jdlh.com/)
      multilingual websites consultant, Vancouver, B.C., Canada

Re: get the sourcecode [of UTF-8]

Reply via email to