[fpc-pascal] Yet another thread on Unicode Strings

Tony Whyman Wed, 04 Oct 2017 05:11:15 -0700

Unicode Character String handling is a question that keeps coming up onthe Free Pascal Mailing lists and, empirically, it is hard to avoid theconclusion that there is something wrong with the way these characterstring types are handled. Otherwise, why does this issue keep arising?

Supporters of the current implementation point to the rich set offunctions available to handle both UTF-8 and UTF-16 in addition tolegacy ANSI code pages. That is true – but it may be that it is also theproblem. The programmer is too often forced to be aware of how stringsare encoded and must make a choice as to which is the preferredcharacter encoding for their program. There then follows confusion overhow to make that choice. Is Delphi compatibility the goal? WhatLanguages must I support? If I want platform independence which is thebest encoding? Which encoding gives the best performance for myalgorithm? And so on.

Another problem is that there is no character type for a UnicodeCharacter. The built-in type “WideChar” is only two bytes and cannothold a UTF-16 code point comprising two surrogate pairs. There is nochar type for a UTF-8 character and, while UCS4Char exists, the LazarusUTF-8 utilities use “cardinal” as the type for a code point (not exactlystrong typing).

In order to stop all this confusion I believe that there has to be areturn to Pascal's original fundamental concept. That is the value of acharacter type represents a character, while the encoding of thecharacter is platform dependent and a choice the compiler makes and notthe programmer. Likewise a character string is an array of charactersthat can be indexed by character (not byte) number, from whichsubstrings can be selected and compared with other strings according tothe locale and the unicode standard collating sequence. Let theprogrammer worry about the algorithm and the compiler worry about thebest implementation.

I want to propose a new character type called “UniChar” - short forUnicode Character, along with a new string type “UniString” and a newcollection “TUniStrings”. I have presented my thoughts here in adetailed paper


see https://mwasoftware.co.uk/docs/unistringproposal.pdf

This is intended to be a fully worked proposal and I have circulated itto provoke discussion and in the hope that it may be useful.

The intent is to create a character and string handling design that isnatural to use with the programmer rarely if ever having to think aboutthe character or string encoding. They are dealing with UnicodeCharacters and strings of Unicode Characters and that is all. Whennecessary, transliteration happens naturally and as a consequence ofstring concatenation, input/output, or in the rare cases whenperformance demands a specific character encoding.

There is also a strong desire to avoid creating more choice and hencemore confusion. The intent is to “embrace and replace”. Both AnsiStringand UnicodeString should be seen as subsets or special cases of theproposed UniString, and with concrete types such as AnsiChar, WideCharand WideString, other than for legacy reasons, existing primarily todefine external interfaces.


Tony Whyman

MWA Software

_______________________________________________
fpc-pascal maillist  -  fpc-pascal@lists.freepascal.org
http://lists.freepascal.org/cgi-bin/mailman/listinfo/fpc-pascal

[fpc-pascal] Yet another thread on Unicode Strings

Reply via email to