Hi Branden,

On 27/04/2023 05:07, G. Branden Robinson wrote:
At 2023-04-26T19:33:48+0200, Oliver Corff wrote:
I am not familiar with modern incarnations of C/C++. Is there really
no char data type that is Unicode-compliant?

There is.  But "Unicode" is a _family_ of standards.  There are multiple
ways to encode Unicode characters, and those ways are good for different
things.
I was intentionally vague.
Along came Unix creator, Ken Thompson, over 20 years after his first
major contribution to software engineering.  Thompson was a man whose
brain took to Huffman coding like a duck to water.  Thus was born UTF-8,
(which isn't a Huffman code precisely but has properties reminiscent of
one) where your ASCII code points would be expressible in one byte, and
then the higher the code point you needed to encode, the longer the
multi-byte sequence you required.  Since the Unicode Consortium had
allocated commonly used symbols and alphabetic scripts toward to lower
code points in the first place, this meant that even where you needed
more than one byte to encode a code point, with UTF-8 you might not need
more than two.  And as a matter of fact, under UTF-8, every character in
every code block up to and including NKo is expressible using up to two
bytes.[2]

I like the Huffman code analogy! The situation is not as clear-cut for
CJK texts; there are massive peaks of frequency at a few dozen or
hundred characters (both in Chinese and Japanese) but due to the
arrangement of characters these biases are not visible from the
character tables --- the distribution is more even, not so much leaning
to the left. In general (i.e., BMP - Basic Multilingual Plane) CJK
characters need three octets, which is a 50% penalty over traditional
CJK encodings (where the user was limited to using *either* Chinese,
simplified, *or* Chinese, traditional, *or* Japanese, but not a mixture
of everything. On today's systems, this does not really slow down work,
and if a text file for a whole book increases from 700 kB to a little
over 1 MB, it doesn't really change anything from a user's perspective.
"Unicode-compliant" is not a precise enough term to mean much of
anything.  Unicode is a family of standards; the core of which most
people of heard, and a pile of "standard annexes" which can be really
important in various domains of practical engineering around this
encoding system.[3]

Regards,
Branden

For the matter of this exchange: I never really left the letter P when
learning programming languages. I started with Pascal when Borland
Pascal on DOS machines was all the rage, and from there directly jumped
to Perl the moment I familiarized myself with X11 workstations at our
university, due to its wonderfully elliptical style (I am a linguist by
training, and many of the Perl language constructs just got alive in my
brain the very instant I used them for the first time). Later I started
learning Prolog, but never made it to Py (and anything that follows).

For all (read: my) practical purposes, Perl reads and stores all
characters in UTF-8 (to be honest: I am not at all aware of the *exact*
internal data storage model of Perl), and I can process strings
containing a wild mix of characters CJK, Cyrillic, and other character
sets) without ever running into problems. I never ever have to process
bytes in characters or observe states as long as the file handles are
declared as :utf8. Even data files dozens of MB containing 10,000 or
100,000s of text lines are processed without any noticeable penalty or
trade-off in time.

Perl is written in C (as far as I know), so probably uses the C libraries.

Would it be a feasible option to use UTF-8 throughout the inner workings
of a future groff, and translate UTF-8 to UTF-16 if and only if there is
the absolute need to do so? You mentioned the PDF bookmarks as a
critical case.

Best regards,

Oliver.

--
Dr. Oliver Corff
Mail: oliver.co...@email.de


Reply via email to