Re: [fpc-pascal] Yet another thread on Unicode Strings

2017-10-04 Thread Marco van de Voort
In our previous episode, Tony Whyman said:
> Unicode Character String handling is a question that keeps coming up on 
> the Free Pascal Mailing lists and, empirically, it is hard to avoid the 
> conclusion that there is something wrong with the way these character 
> string types are handled. Otherwise, why does this issue keep arising?

Because people have old code that is ascii, or handles unicode in a
different, ad-hoc matter. Moreover FPC/Lazarus is also still usable in an
ascii only mode for old projects.

> The programmer is too often forced to be aware of how strings 
> are encoded and must make a choice as to which is the preferred 
> character encoding for their program. There then follows confusion over 
> how to make that choice.

To avoid confusion, make sure it is unicode. It doesn't matter that
much if it is utf16 or not.

> Is Delphi compatibility the goal? What 
> Languages must I support? If I want platform independence which is the 
> best encoding? Which encoding gives the best performance for my 
> algorithm? And so on.
 
> Another problem is that there is no character type for a Unicode 
> Character. The built-in type ?WideChar? is only two bytes and cannot 
> hold a UTF-16 code point comprising two surrogate pairs. There is no 
> char type for a UTF-8 character and, while UCS4Char exists, the Lazarus 
> UTF-8 utilities use ?cardinal? as the type for a code point (not exactly 
> strong typing).

Most code will simply use "string" to hold a character. Only special and
code that really must be performant will do other things.
 
> In order to stop all this confusion I believe that there has to be a 
> return to Pascal's original fundamental concept. That is the value of a 
> character type represents a character, while the encoding of the 
> character is platform dependent and a choice the compiler makes and not 
> the programmer. Likewise a character string is an array of characters 
> that can be indexed by character (not byte) number, from which 
> substrings can be selected and compared with other strings according to 
> the locale and the unicode standard collating sequence. Let the 
> programmer worry about the algorithm and the compiler worry about the 
> best implementation.
>
> I want to propose a new character type called ?UniChar? - short for 
> Unicode Character, along with a new string type ?UniString? and a new 
> collection ?TUniStrings?. I have presented my thoughts here in a 
> detailed paper
>
This doesn't work, and it seems you haven't read the backlog for unicode
related messages all the way back to early 2009. What you suggest was one of
the null hypotheses back then, and we are now 8 years further.

Search for the unicode meanings of (1) glyph, (2) character (3) codepoint
(4) denormalized strings.

If you digest all that, you need to define the unichar type very large,
blowing up strings enormously, and then again converting it back to either
utf16 or utf8 to communicate with nearly anything (APIs, libraries etc)

Moreover it will just require yet another conversion and more confusion with
more competing systems. So the number of problems will only rise. And the
incompatibility to Delphi is still there, so will create trouble ad
infinitum.

This argument is best summed up by this cartoon: https://xkcd.com/927/

In short, there is no substitute than to actively learn what unicode is
about and live with it. 

Some of the problems were summed up in the discussion back then:
http://www.stack.nl/~marcov/unicode.pdf

Note that in hindsight I don't think Florian's proposal was that bad, and
Florian was somewhat vindicated by Delphi's choice for multi encoding
ansistring type.

My new opinion is that whatever the choice is, I think to choose different
from Delphi (despite all its flaws, perceived OR real, doesn't matter) was
wrong.
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

Re: [fpc-pascal] Yet another thread on Unicode Strings

2017-10-04 Thread Mattias Gaertner
On Wed, 4 Oct 2017 13:10:02 +0100
Tony Whyman  wrote:

> Unicode Character String handling is a question that keeps coming up on 
> the Free Pascal Mailing lists and, empirically, it is hard to avoid the 
> conclusion that there is something wrong with the way these character 
> string types are handled. Otherwise, why does this issue keep arising?

Mixing string types, mixing encodings, mixing legacy code, confusing
UCS-2 with UTF-16, 


>[...]
> Another problem is that there is no character type for a Unicode 
> Character.

I'm curious: What languages have such a type?

> The built-in type “WideChar” is only two bytes and cannot 
> hold a UTF-16 code point comprising two surrogate pairs. There is no 
> char type for a UTF-8 character and, while UCS4Char exists, the Lazarus 
> UTF-8 utilities use “cardinal” as the type for a code point (not exactly 
> strong typing).

Should be remedied.

>[...]
>Let the programmer worry about the algorithm and the compiler worry about the 
best implementation.

An UTF-32 string type is seldom the best choice for memory
and/or speed.

>[...]
> I want to propose a new character type called “UniChar” - short for 
> Unicode Character, along with a new string type “UniString” and a new 
> collection “TUniStrings”. I have presented my thoughts here in a 
> detailed paper
> 
> see https://mwasoftware.co.uk/docs/unistringproposal.pdf
> 
> This is intended to be a fully worked proposal and I have circulated it 
> to provoke discussion and in the hope that it may be useful.

Adding another string type without disabling some old string types will
increase the confusion. Please provide a proposal for disabling old
string types.

Also keep in mind, that there is still no UTF-16 RTL, even though
many people need that for Delphi compatibility. Starting yet another
UTF-32 RTL need some heavy dedicated programmers.

Mattias
___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] Yet another thread on Unicode Strings

2017-10-04 Thread Tony Whyman
Unicode Character String handling is a question that keeps coming up on 
the Free Pascal Mailing lists and, empirically, it is hard to avoid the 
conclusion that there is something wrong with the way these character 
string types are handled. Otherwise, why does this issue keep arising?


Supporters of the current implementation point to the rich set of 
functions available to handle both UTF-8 and UTF-16 in addition to 
legacy ANSI code pages. That is true – but it may be that it is also the 
problem. The programmer is too often forced to be aware of how strings 
are encoded and must make a choice as to which is the preferred 
character encoding for their program. There then follows confusion over 
how to make that choice. Is Delphi compatibility the goal? What 
Languages must I support? If I want platform independence which is the 
best encoding? Which encoding gives the best performance for my 
algorithm? And so on.


Another problem is that there is no character type for a Unicode 
Character. The built-in type “WideChar” is only two bytes and cannot 
hold a UTF-16 code point comprising two surrogate pairs. There is no 
char type for a UTF-8 character and, while UCS4Char exists, the Lazarus 
UTF-8 utilities use “cardinal” as the type for a code point (not exactly 
strong typing).


In order to stop all this confusion I believe that there has to be a 
return to Pascal's original fundamental concept. That is the value of a 
character type represents a character, while the encoding of the 
character is platform dependent and a choice the compiler makes and not 
the programmer. Likewise a character string is an array of characters 
that can be indexed by character (not byte) number, from which 
substrings can be selected and compared with other strings according to 
the locale and the unicode standard collating sequence. Let the 
programmer worry about the algorithm and the compiler worry about the 
best implementation.


I want to propose a new character type called “UniChar” - short for 
Unicode Character, along with a new string type “UniString” and a new 
collection “TUniStrings”. I have presented my thoughts here in a 
detailed paper


see https://mwasoftware.co.uk/docs/unistringproposal.pdf

This is intended to be a fully worked proposal and I have circulated it 
to provoke discussion and in the hope that it may be useful.


The intent is to create a character and string handling design that is 
natural to use with the programmer rarely if ever having to think about 
the character or string encoding. They are dealing with Unicode 
Characters and strings of Unicode Characters and that is all. When 
necessary, transliteration happens naturally and as a consequence of 
string concatenation, input/output, or in the rare cases when 
performance demands a specific character encoding.


There is also a strong desire to avoid creating more choice and hence 
more confusion. The intent is to “embrace and replace”. Both AnsiString 
and UnicodeString should be seen as subsets or special cases of the 
proposed UniString, and with concrete types such as AnsiChar, WideChar 
and WideString, other than for legacy reasons, existing primarily to 
define external interfaces.


Tony Whyman

MWA Software

___
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal