Re: [ProofPower] "Unicode" vs. UTF8

2023-11-20 Thread Rob Arthan
Makarius,

Good to hear from you.


> On 7 Nov 2023, at 10:09, Makarius  wrote:
> 
> On 06/11/2023 02:19, Rob Arthan wrote:
>> One bit of burrowing inside instigated by the developers rather than the 
>> users began a few years ago: an experimental port of ProofPower to use 
>> Unicode and UTF-8 for mathematical symbols in place of the ad hoc character 
>> set that we had to invent for ourselves years ago. Roger Jones did some 
>> excellent work towards this end and made a version of the document 
>> processing tools that would compile all the ML code from UTF-8 sources using 
>> the old character set as an intermediary. I ported the user interface 
>> program xpp based on his work, but it has proved problematic to port all 
>> that we did to Linux: we were developing on Mac OS and, sadly, the Free BSD 
>> wchar_t libraries are much less restrictive than the Gnu libraries and it 
>> seems to need a lot more work on the document processing utilities (that are 
>> written in C) to get round this.
> 
> I first heard of "Unicode" approx. 1990. Everybody was talking about "wide 
> charaters" (16bit) and made moves to support them (e.g. major Unix vendors 
> and MicroSoft). That was the time of now obsolete UCS16: "16bit should be 
> sufficient for everyone, i.e. every imaginable character on the planet".

With ProofPower, we were just a year or two earlier, when "8-bit clean”, i.e. 
compatibility with iso8859 was the name of the game. None of the iso8559 
encodings covered the mathematical symbols that we needed so we rolled our own 
encoding that covered the symbols we needed.

> 
> Many years later, everybody started to use UTF-8 and the meaning "Unicode" 
> changed quite a bit. The "UTF-8 manifesto" http://utf8everywhere.org helps to 
> understand the situation in retrospective. In particular, "wide characters" 
> are now mostly obsolete, although we still see them as building blocks for 
> UTF-16 (which is not the same sas 16-bit Unicode). Programming languages with 
> wide characters carry unnecessary baggage. E.g. see this side remark 
> (concerning Python) in the text:
> 
> """
> Furthermore, we would like to suggest that counting or otherwise iterating 
> over Unicode code points should not be seen as a particularly important task 
> in text processing scenarios. Many developers mistakenly see code points as a 
> kind of a successor to ASCII characters. This lead to software design 
> decisions such as Python’s string O(1) code point access. The truth, however, 
> is that Unicode is inherently more complicated and there is no universal 
> definition of such thing as Unicode character. We see no particular reason to 
> favor Unicode code points over Unicode grapheme clusters, code units or 
> perhaps even words in a language for that. On the other hand, seeing UTF-8 
> code units (bytes) as a basic unit of text seems particularly useful for many 
> tasks, such as parsing commonly used textual data formats. This is due to a 
> particular feature of this encoding. Graphemes, code units, code points and 
> other relevant Unicode terms are explained in Section 5. Operations on 
> encoded text strings are discussed in Section 7.
> """
> 

The way this pans out in C (and in particular in the OpenMotif libraries) is 
that you can either work with UTF-8 encodings as strings of bytes or with 
wchar_t encodings where a wchar_t is a number representing a Unicode code 
point. On Mac OS, this is no problem, because the wchar_t type comprises 32-bit 
integers and that is plenty to cover all the Unicode code points. Alas, that 
doesn’t work on MS Windows if you need to use symbols outside the Basic 
Multilingual Plane (i.e., with code points >= 2^16). I don’t agree with the 
thinking behind thje UTF-8 manifesto: it seems to be telling us that we 
shouldn’t be using a conceptually simple model of strings of symbols drawn from 
a finite alphabet of known size just because certain parties underestimated the 
required size.

The particular programming problem we have in ProofPower with this is that (to 
allow transition from the old world to the knew and vice versa) we need to 
convert an 8-bit clean encoding to a UTF-8 encoding and vice versa. This 
involves a great deal of work as the C locale mechanisms aren’t very helpful 
with this.

> (Maybe this helps to prove that the mailing list is alive.)
> 

Let’s hope so!

Best,

Rob

> 
>   Makarius

> 
> 
> ___
> Proofpower mailing list
> Proofpower@lemma-one.com
> http://lemma-one.com/mailman/listinfo/proofpower_lemma-one.com


___
Proofpower mailing list
Proofpower@lemma-one.com
http://lemma-one.com/mailman/listinfo/proofpower_lemma-one.com


Re: [ProofPower] "Unicode" vs. UTF8

2023-11-07 Thread Roger Bishop Jones

On 07/11/2023 10:09, Makarius wrote:
I first heard of "Unicode" approx. 1990. Everybody was talking about 
"wide charaters" (16bit) and made moves to support them (e.g. major 
Unix vendors and MicroSoft). That was the time of now obsolete UCS16: 
"16bit should be sufficient for everyone, i.e. every imaginable 
character on the planet".
We began working (at ICL) with Cambridge HOL in 1986, using a modified 
version for an important secure hardware verification 88-9 before 
starting the project which first developed ProofPower in 90-93.
The emphasis was on verification of systems specified in Z, and so at 
first we had done small things to make HOL look more like Z, and one of 
those was simply to use the extended character set.

At that time we were on Sun workstations, and Linux didn't exist.
Sun had a font editor, so I just added extra characters in the empty 
spaces in the existing character sets and edited the font myself; things 
were way simpler in those days.


There are of course things we would have used if they had been on our 
radar.  UTF would have been a good idea, and it would have been better 
perhaps to have started out with an open source project, but we were 
scuppered by the commercialisation by Cambridge University of the Poly 
ML system which had been chosen as the best base for ProofPower (before 
that happened).


Roger


___
Proofpower mailing list
Proofpower@lemma-one.com
http://lemma-one.com/mailman/listinfo/proofpower_lemma-one.com


Re: [ProofPower] "Unicode" vs. UTF8

2023-11-07 Thread Makarius

On 06/11/2023 02:19, Rob Arthan wrote:


One bit of burrowing inside instigated by the developers rather than the users 
began a few years ago: an experimental port of ProofPower to use Unicode and 
UTF-8 for mathematical symbols in place of the ad hoc character set that we had 
to invent for ourselves years ago. Roger Jones did some excellent work towards 
this end and made a version of the document processing tools that would compile 
all the ML code from UTF-8 sources using the old character set as an 
intermediary. I ported the user interface program xpp based on his work, but it 
has proved problematic to port all that we did to Linux: we were developing on 
Mac OS and, sadly, the Free BSD wchar_t libraries are much less restrictive 
than the Gnu libraries and it seems to need a lot more work on the document 
processing utilities (that are written in C) to get round this.


I first heard of "Unicode" approx. 1990. Everybody was talking about "wide 
charaters" (16bit) and made moves to support them (e.g. major Unix vendors and 
MicroSoft). That was the time of now obsolete UCS16: "16bit should be 
sufficient for everyone, i.e. every imaginable character on the planet".


Many years later, everybody started to use UTF-8 and the meaning "Unicode" 
changed quite a bit. The "UTF-8 manifesto" http://utf8everywhere.org helps to 
understand the situation in retrospective. In particular, "wide characters" 
are now mostly obsolete, although we still see them as building blocks for 
UTF-16 (which is not the same sas 16-bit Unicode). Programming languages with 
wide characters carry unnecessary baggage. E.g. see this side remark 
(concerning Python) in the text:


"""
Furthermore, we would like to suggest that counting or otherwise iterating 
over Unicode code points should not be seen as a particularly important task 
in text processing scenarios. Many developers mistakenly see code points as a 
kind of a successor to ASCII characters. This lead to software design 
decisions such as Python’s string O(1) code point access. The truth, however, 
is that Unicode is inherently more complicated and there is no universal 
definition of such thing as Unicode character. We see no particular reason to 
favor Unicode code points over Unicode grapheme clusters, code units or 
perhaps even words in a language for that. On the other hand, seeing UTF-8 
code units (bytes) as a basic unit of text seems particularly useful for many 
tasks, such as parsing commonly used textual data formats. This is due to a 
particular feature of this encoding. Graphemes, code units, code points and 
other relevant Unicode terms are explained in Section 5. Operations on encoded 
text strings are discussed in Section 7.

"""

(Maybe this helps to prove that the mailing list is alive.)


Makarius


___
Proofpower mailing list
Proofpower@lemma-one.com
http://lemma-one.com/mailman/listinfo/proofpower_lemma-one.com